Back to Learning Center
Learning Center

CRM Data Cleanup with AI (Before You Build Anything)

How to use AI to classify, deduplicate, and standardize CRM data. Austin PE firm approach.

CRM Data Cleanup with AI (Before You Build Anything)

Your CRM is a mess. Duplicate contacts with three different email addresses. Company names entered as "IBM", "I.B.M.", and "International Business Machines Corp." Opportunities from 2019 still marked "Negotiation - 90%". You know it, your team knows it, and every report you pull confirms it.

Here's what most firms do wrong: they buy a new integration, hire a CRM consultant, or launch a "data quality initiative" that dies in six weeks. Then they wonder why their marketing automation sends three emails to the same person or why their pipeline reports are fiction.

The Austin PE firm approach is different. Before building workflows, before implementing AI assistants, before anything - you systematically clean the data using AI to do the heavy lifting. This is the exact process we used with portfolio companies managing 50,000+ CRM records across Salesforce, HubSpot, and Dynamics.

This takes 2-4 weeks of focused work. The payoff is a CRM you can actually trust.

Step 1: Classify Every Record Type in Your Database

You cannot clean what you cannot categorize. Most CRMs contain 6-12 distinct object types, but firms treat them like one undifferentiated blob.

Run this audit first:

  1. Export a sample of 500 random records from your CRM (CSV format)
  2. Open Claude.ai or ChatGPT with GPT-4
  3. Upload the file and use this prompt:
Analyze this CRM export. Identify every distinct record type present (contacts, accounts, opportunities, leads, activities, custom objects). For each type, list:
- Defining characteristics
- Key fields that should be present
- Common data quality issues you observe
- Percentage of records that appear to be this type

Format as a table.

What you'll discover:

Your "Contacts" object contains actual people, but also generic emails like info@company.com, former employees, and vendor contacts that should live elsewhere. Your "Accounts" object mixes active clients, dead prospects from 2017, and referral partners.

Create a classification schema:

| Record Type | Must-Have Fields | Auto-Classification Rule | |-------------|------------------|--------------------------| | Active Contact | First name, last name, company email, valid phone | Has @company domain + phone format (XXX) XXX-XXXX + created/modified within 24 months | | Active Account | Company name, website, industry | Has valid URL + industry field populated + associated with 1+ active contact | | Dead Opportunity | Close date, stage, amount | Close date > 12 months ago + stage = "Closed Lost" or "Stalled" | | Vendor/Partner | Company name, relationship type | Tagged as "Vendor" or "Partner" OR domain matches known vendor list |

Use Claude Projects for bulk classification:

  1. Create a new Project in Claude.ai
  2. Upload your full CRM export (Claude handles files up to 100MB)
  3. Add this instruction to Project Knowledge: "You are a CRM data classifier. Apply the classification schema I provide to every record. Output a CSV with original record ID + assigned record type + confidence score (0-100)."
  4. Process in batches of 5,000 records

For Salesforce users: use a Python script with the Salesforce API and OpenAI's batch API. Cost: approximately $8 per 100,000 records classified.

Step 2: Deduplicate Using Fuzzy Matching and AI Judgment

Deduplication is where most cleanup projects fail. Firms use CRM native tools that only catch exact matches, missing 60-70% of actual duplicates.

Contact deduplication process:

  1. Export all contacts to CSV
  2. Use Python with the fuzzywuzzy library or the dedupe library (both free, open-source)
  3. Set matching thresholds:
    • Email exact match = 100% duplicate
    • Phone exact match = 95% duplicate (allows for formatting differences)
    • Name + company fuzzy match >85% = flag for AI review

AI review prompt for borderline matches:

I have two CRM contact records that may be duplicates:

Record A:
Name: John Smith
Email: jsmith@acmecorp.com
Phone: (512) 555-0100
Title: Partner
Company: Acme Corporation

Record B:
Name: Jonathan Smith
Email: john.smith@acme-corp.com
Phone: 512-555-0100
Title: Managing Partner
Company: ACME Corp

Are these the same person? If yes, which record has more complete/accurate data? Respond with: DUPLICATE - Keep Record [A/B] OR NOT_DUPLICATE.

Run this through Claude's API for every flagged pair. Cost: $0.02 per comparison at current API pricing.

Account deduplication is harder:

Company names are a nightmare. "PricewaterhouseCoopers LLP", "PwC", "PricewaterhouseCoopers", and "PWC US" are all the same firm.

Use Clearbit's Company API or ZoomInfo's matching API to resolve company names to a canonical form. Both offer free tiers for up to 1,000 lookups/month. For larger databases, expect $200-500/month.

Merge strategy:

  • Keep the record with the most recent activity date as the master
  • Append notes/custom fields from duplicate records to the master record's activity history
  • Reassign all related opportunities, cases, and activities to the master record
  • Mark duplicates as "Merged - Do Not Use" before deleting (keep for 90 days as backup)

Step 3: Standardize Formatting and Field Values

Standardization is not about making data "pretty". It's about making it queryable and reportable.

Phone number standardization:

Use the phonenumbers Python library (Google's libphonenumber). It handles international formats, extensions, and validation.

import phonenumbers

def standardize_phone(phone_string, default_region='US'):
    try:
        parsed = phonenumbers.parse(phone_string, default_region)
        return phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)
    except:
        return None

Output format: +15125550100 (E.164 international standard)

Company name standardization:

  1. Remove legal suffixes: LLC, Inc., Corp., Ltd., LLP (use regex)
  2. Expand common abbreviations: "Intl" → "International", "Mfg" → "Manufacturing"
  3. Use OpenAI's API to resolve ambiguous cases:
Standardize this company name to its official form: "PWC Advisory Svcs LLC"

Return only the standardized name, no explanation.

Expected output: "PricewaterhouseCoopers"

Industry classification:

Most CRMs have 200+ industry picklist values that nobody uses consistently. Collapse to 15-20 standard categories.

Use this Claude prompt for bulk reclassification:

Reclassify these industry values into one of these 15 standard categories: [list your categories]

Input industries: [paste 50-100 at a time]

Output format: Original Industry | Standard Category

Address standardization:

Use the Google Maps Geocoding API or USPS Address Validation API. Both are free for reasonable volumes (<10,000 addresses/month).

This gives you clean, standardized addresses plus latitude/longitude for territory mapping.

Step 4: Enrich with External Data Sources

Enrichment is not optional. Your CRM contains 30% of the data you need to run effective outreach and reporting.

Contact enrichment sources:

  • Apollo.io: 10 free credits/month, then $49/month for 1,000 credits. Returns email, phone, title, LinkedIn URL.
  • Hunter.io: Email finder and verification. 25 free searches/month.
  • LinkedIn Sales Navigator: Manual enrichment for high-value contacts. $99/month per seat.

Account enrichment sources:

  • Clearbit Enrichment API: $99/month for 2,500 lookups. Returns employee count, revenue, tech stack, funding.
  • ZoomInfo: Enterprise pricing (expect $15K-30K/year). Most complete B2B database.
  • BuiltWith: Technology stack data. $295/month for API access.

Enrichment workflow:

  1. Export accounts missing key fields (employee count, revenue, industry)
  2. Run through Clearbit API first (cheapest, good coverage for US companies)
  3. For non-matches, try ZoomInfo or manual LinkedIn research
  4. Use Claude to extract data from company websites:
Visit this company website: [URL]

Extract and return in JSON format:
- Employee count (estimate if not stated)
- Primary industry
- Headquarters location
- Key products/services (max 3)

If information is not available, return null for that field.

Enrichment adds 15-20 hours of work but doubles the usability of your CRM.

Step 5: Validate and Lock Down Data Quality

Cleanup is worthless if your team immediately re-corrupts the data.

Implement these controls:

  1. Required fields at creation: Make industry, phone, and company website mandatory for new accounts
  2. Picklist restrictions: Convert free-text fields to picklists wherever possible (industry, lead source, account type)
  3. Validation rules: Block saving a contact without a valid email format or phone number format
  4. Duplicate prevention: Enable Salesforce's native duplicate matching or HubSpot's duplicate management (post-cleanup, these tools work well)

Create a data quality dashboard:

Track these metrics weekly:

  • Percentage of contacts with valid email + phone
  • Percentage of accounts with industry + employee count
  • Number of duplicate records created (should be <5/week)
  • Percentage of opportunities with next step + close date

Assign one person as data quality owner. This is a 2-4 hour/week role, not a full-time job.

Tools and Costs Summary

Free/low-cost tools:

  • Claude.ai Projects: $20/month for bulk classification and enrichment prompts
  • Python libraries (fuzzywuzzy, phonenumbers, dedupe): Free
  • Google Sheets + Apps Script: Free, good for small datasets (<10K records)

Paid tools worth the investment:

  • Clearbit: $99/month (account enrichment)
  • Apollo.io: $49/month (contact enrichment)
  • Zapier or Make.com: $20-50/month (automation for ongoing data quality)

Total cost for a 20,000-record CRM cleanup: $500-800 in tools + 60-80 hours of work.

What Happens After Cleanup

You now have a CRM where:

  • Every contact has a valid email and phone number
  • Every account has industry, size, and location data
  • Duplicate records are eliminated
  • Field formatting is consistent and queryable

This is the foundation. Now you can build:

  • Marketing automation that doesn't embarrass you
  • Pipeline reports that match reality
  • AI assistants that pull accurate data
  • Territory assignments that make sense

Firms that skip cleanup spend 18 months fighting their CRM. Firms that do cleanup spend 3 weeks, then build with confidence.

Clean your data first. Build everything else second.

Frequently Asked Questions

How do I clean my CRM data using AI? Five-step process: (1) Classify record types using Claude or GPT-4 on a 500-record export. (2) Deduplicate using Python's fuzzywuzzy library + AI review of borderline matches ($0.02/comparison via API). (3) Standardize formatting - phonenumbers library for phones, AI prompts for company name normalization. (4) Enrich with Apollo.io ($49/month for contacts) and Clearbit ($99/month for accounts). (5) Lock down quality with required fields, picklist restrictions, and duplicate prevention rules.

How much does CRM data cleanup cost? For a 20,000-record CRM: $500-800 in tools + 60-80 hours of work. Tool costs: Claude.ai ($20/month), Clearbit ($99/month), Apollo.io ($49/month). Python libraries are free. The 60-80 hours is the actual bottleneck - plan 2-4 weeks of focused work. Payoff: pipeline reports that match reality and AI workflows that can trust their input data.

What should I fix in my CRM before building AI workflows? Five must-fix items: (1) Duplicate contacts and accounts. (2) Inconsistent company names ('IBM' vs 'I.B.M.' vs 'International Business Machines'). (3) Dead opportunities skewing AI predictions. (4) Missing required fields (email, phone, industry, size). (5) Free-text chaos in fields that should be structured - time entries, industry classification, lead source.

How long does CRM data cleanup take? Budget 2-4 weeks for a 20,000-record CRM: Week 1 (audit and classify), Week 2 (deduplicate), Week 3 (standardize and enrich), Week 4 (implement quality controls and train team). For 100,000+ record CRMs, plan 6-8 weeks and budget additional engineering time for API-based batch processing.

Get the Book

The full system, end to end.

Looking to build your AI workforce? Get the comprehensive guide for professional services - the 12 plays, the frameworks, and the field-tested playbooks.

Buy on Amazon
Revenue Institute

Reviewed by Revenue Institute

This guide is actively maintained and reviewed by the implementation experts at Revenue Institute. As the creators of The AI Workforce Playbook, we test and deploy these exact frameworks for professional services firms scaling without new headcount.

Done-For-You Implementation

Need help turning this guide into reality?

Revenue Institute builds and implements the AI workforce for professional services firms.

Work with Revenue Institute