Back to Learning Center
Learning Center

CRM Data Cleanup with AI (Before You Build Anything)

How to use AI to classify, deduplicate, and standardize CRM data. Austin PE firm approach.

CRM
Data Cleanup with AI (Before You Build Anything)

Your CRM

is a mess. Duplicate contacts with three different email addresses. Company names entered as "IBM", "I.B.M.", and "International Business Machines Corp." Opportunities from 2019 still marked "Negotiation - 90%". You know it, your team knows it, and every report you pull confirms it.

Here's what most firms do wrong: they buy a new integration, hire a CRM

consultant, or launch a "data quality initiative" that dies in six weeks. Then they wonder why their marketing automation sends three emails to the same person or why their pipeline reports are fiction.

The Austin PE firm approach is different. Before building workflows, before implementing AI assistants, before anything - you systematically clean the data using AI to do the heavy lifting. This is the exact process we used with portfolio companies managing 50,000+ CRM

records across Salesforce, HubSpot, and Dynamics.

This takes 2-4 weeks of focused work. The payoff is a CRM

you can actually trust.

Step 1: Classify Every Record Type in Your Database

You cannot clean what you cannot categorize. Most CRMs

contain 6-12 distinct object types, but firms treat them like one undifferentiated blob.

Run this audit first:

  1. Export a sample of 500 random records from your CRM
    (CSV format)
  2. Open Claude.ai or ChatGPT with GPT-4
  3. Upload the file and use this prompt:
Analyze this CRM export. Identify every distinct record type present (contacts, accounts, opportunities, leads, activities, custom objects). For each type, list:
- Defining characteristics
- Key fields that should be present
- Common data quality issues you observe
- Percentage of records that appear to be this type

Format as a table.

What you'll discover:

Your "Contacts" object contains actual people, but also generic emails like info@company.com, former employees, and vendor contacts that should live elsewhere. Your "Accounts" object mixes active clients, dead prospects from 2017, and referral partners.

Create a classification schema:

| Record Type | Must-Have Fields | Auto-Classification Rule | |-------------|------------------|--------------------------| | Active Contact | First name, last name, company email, valid phone | Has @company domain + phone format (XXX) XXX-XXXX + created/modified within 24 months | | Active Account | Company name, website, industry | Has valid URL + industry field populated + associated with 1+ active contact | | Dead Opportunity | Close date, stage, amount | Close date > 12 months ago + stage = "Closed Lost" or "Stalled" | | Vendor/Partner | Company name, relationship type | Tagged as "Vendor" or "Partner" OR domain matches known vendor list |

Use Claude Projects for bulk classification:

  1. Create a new Project in Claude.ai
  2. Upload your full CRM
    export (Claude handles files up to 100MB)
  3. Add this instruction to Project Knowledge: "You are a CRM
    data classifier. Apply the classification schema I provide to every record. Output a CSV with original record ID + assigned record type + confidence score (0-100)."
  4. Process in batches of 5,000 records

For Salesforce users: use a Python script with the Salesforce API and OpenAI's batch API

. Cost: approximately $8 per 100,000 records classified.

Step 2: Deduplicate Using Fuzzy Matching and AI Judgment

Deduplication is where most cleanup projects fail. Firms use CRM

native tools that only catch exact matches, missing 60-70% of actual duplicates.

Contact deduplication process:

  1. Export all contacts to CSV
  2. Use Python with the fuzzywuzzy library or the dedupe library (both free, open-source)
  3. Set matching thresholds:
    • Email exact match = 100% duplicate
    • Phone exact match = 95% duplicate (allows for formatting differences)
    • Name + company fuzzy match >85% = flag for AI review

AI review prompt for borderline matches:

I have two CRM contact records that may be duplicates:

Record A:
Name: John Smith
Email: jsmith@acmecorp.com
Phone: (512) 555-0100
Title: Partner
Company: Acme Corporation

Record B:
Name: Jonathan Smith
Email: john.smith@acme-corp.com
Phone: 512-555-0100
Title: Managing Partner
Company: ACME Corp

Are these the same person? If yes, which record has more complete/accurate data? Respond with: DUPLICATE - Keep Record [A/B] OR NOT_DUPLICATE.

Run this through Claude's API

for every flagged pair. Cost: $0.02 per comparison at current API
pricing.

Account deduplication is harder:

Company names are a nightmare. "PricewaterhouseCoopers LLP", "PwC", "PricewaterhouseCoopers", and "PWC US" are all the same firm.

Use Clearbit's Company API

or ZoomInfo's matching API
to resolve company names to a canonical form. Both offer free tiers for up to 1,000 lookups/month. For larger databases, expect $200-500/month.

Merge strategy:

  • Keep the record with the most recent activity date as the master
  • Append notes/custom fields from duplicate records to the master record's activity history
  • Reassign all related opportunities, cases, and activities to the master record
  • Mark duplicates as "Merged - Do Not Use" before deleting (keep for 90 days as backup)

Step 3: Standardize Formatting and Field Values

Standardization is not about making data "pretty". It's about making it queryable and reportable.

Phone number standardization:

Use the phonenumbers Python library (Google's libphonenumber). It handles international formats, extensions, and validation.

import phonenumbers

def standardize_phone(phone_string, default_region='US'):
    try:
        parsed = phonenumbers.parse(phone_string, default_region)
        return phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)
    except:
        return None

Output format: +15125550100 (E.164 international standard)

Company name standardization:

  1. Remove legal suffixes: LLC, Inc., Corp., Ltd., LLP (use regex)
  2. Expand common abbreviations: "Intl" → "International", "Mfg" → "Manufacturing"
  3. Use OpenAI's API
    to resolve ambiguous cases:
Standardize this company name to its official form: "PWC Advisory Svcs LLC"

Return only the standardized name, no explanation.

Expected output: "PricewaterhouseCoopers"

Industry classification:

Most CRMs

have 200+ industry picklist values that nobody uses consistently. Collapse to 15-20 standard categories.

Use this Claude prompt for bulk reclassification:

Reclassify these industry values into one of these 15 standard categories: [list your categories]

Input industries: [paste 50-100 at a time]

Output format: Original Industry | Standard Category

Address standardization:

Use the Google Maps Geocoding API

or USPS Address Validation API
. Both are free for reasonable volumes (<10,000 addresses/month).

This gives you clean, standardized addresses plus latitude/longitude for territory mapping.

Step 4: Enrich with External Data Sources

Enrichment is not optional. Your CRM

contains 30% of the data you need to run effective outreach and reporting.

Contact enrichment sources:

  • Apollo.io: 10 free credits/month, then $49/month for 1,000 credits. Returns email, phone, title, LinkedIn URL.
  • Hunter.io: Email finder and verification. 25 free searches/month.
  • LinkedIn Sales Navigator: Manual enrichment for high-value contacts. $99/month per seat.

Account enrichment sources:

  • Clearbit Enrichment API
    : $99/month for 2,500 lookups. Returns employee count, revenue, tech stack, funding.
  • ZoomInfo: Enterprise pricing (expect $15K-30K/year). Most complete B2B database.
  • BuiltWith: Technology stack data. $295/month for API
    access.

Enrichment workflow:

  1. Export accounts missing key fields (employee count, revenue, industry)
  2. Run through Clearbit API
    first (cheapest, good coverage for US companies)
  3. For non-matches, try ZoomInfo or manual LinkedIn research
  4. Use Claude to extract data from company websites:
Visit this company website: [URL]

Extract and return in JSON format:
- Employee count (estimate if not stated)
- Primary industry
- Headquarters location
- Key products/services (max 3)

If information is not available, return null for that field.

Enrichment adds 15-20 hours of work but doubles the usability of your CRM

.

Step 5: Validate and Lock Down Data Quality

Cleanup is worthless if your team immediately re-corrupts the data.

Implement these controls:

  1. Required fields at creation: Make industry, phone, and company website mandatory for new accounts
  2. Picklist restrictions: Convert free-text fields to picklists wherever possible (industry, lead source, account type)
  3. Validation rules: Block saving a contact without a valid email format or phone number format
  4. Duplicate prevention: Enable Salesforce's native duplicate matching or HubSpot's duplicate management (post-cleanup, these tools work well)

Create a data quality dashboard:

Track these metrics weekly:

  • Percentage of contacts with valid email + phone
  • Percentage of accounts with industry + employee count
  • Number of duplicate records created (should be <5/week)
  • Percentage of opportunities with next step + close date

Assign one person as data quality owner. This is a 2-4 hour/week role, not a full-time job.

Tools and Costs Summary

Free/low-cost tools:

  • Claude.ai Projects: $20/month for bulk classification and enrichment prompts
  • Python libraries (fuzzywuzzy, phonenumbers, dedupe): Free
  • Google Sheets + Apps Script: Free, good for small datasets (<10K records)

Paid tools worth the investment:

  • Clearbit: $99/month (account enrichment)
  • Apollo.io: $49/month (contact enrichment)
  • Zapier or Make.com: $20-50/month (automation for ongoing data quality)

Total cost for a 20,000-record CRM

cleanup: $500-800 in tools + 60-80 hours of work.

What Happens After Cleanup

You now have a CRM

where:

  • Every contact has a valid email and phone number
  • Every account has industry, size, and location data
  • Duplicate records are eliminated
  • Field formatting is consistent and queryable

This is the foundation. Now you can build:

  • Marketing automation that doesn't embarrass you
  • Pipeline reports that match reality
  • AI assistants that pull accurate data
  • Territory assignments that make sense

Firms that skip cleanup spend 18 months fighting their CRM

. Firms that do cleanup spend 3 weeks, then build with confidence.

Clean your data first. Build everything else second.

Revenue Institute

Reviewed by Revenue Institute

This guide is actively maintained and reviewed by the implementation experts at Revenue Institute. As the creators of The AI Workforce Playbook, we test and deploy these exact frameworks for professional services firms scaling without new headcount.

Revenue Institute

Need help turning this guide into reality? Revenue Institute builds and implements the AI workforce for professional services firms.

RevenueInstitute.com