CRM Data Cleanup with AI (Before You Build Anything)
How to use AI to classify, deduplicate, and standardize CRM data. Austin PE firm approach.
CRM CRMClick to read the full definition in our AI & Automation Glossary. Data Cleanup with AI (Before You Build Anything)
Your CRM
Here's what most firms do wrong: they buy a new integration, hire a CRM
The Austin PE firm approach is different. Before building workflows, before implementing AI assistants, before anything - you systematically clean the data using AI to do the heavy lifting. This is the exact process we used with portfolio companies managing 50,000+ CRM
This takes 2-4 weeks of focused work. The payoff is a CRM
Step 1: Classify Every Record Type in Your Database
You cannot clean what you cannot categorize. Most CRMs
Run this audit first:
- Export a sample of 500 random records from your CRM(CSV format)CRMClick to read the full definition in our AI & Automation Glossary.
- Open Claude.ai or ChatGPT with GPT-4
- Upload the file and use this prompt:
Analyze this CRM export. Identify every distinct record type present (contacts, accounts, opportunities, leads, activities, custom objects). For each type, list:
- Defining characteristics
- Key fields that should be present
- Common data quality issues you observe
- Percentage of records that appear to be this type
Format as a table.
What you'll discover:
Your "Contacts" object contains actual people, but also generic emails like info@company.com, former employees, and vendor contacts that should live elsewhere. Your "Accounts" object mixes active clients, dead prospects from 2017, and referral partners.
Create a classification schema:
| Record Type | Must-Have Fields | Auto-Classification Rule | |-------------|------------------|--------------------------| | Active Contact | First name, last name, company email, valid phone | Has @company domain + phone format (XXX) XXX-XXXX + created/modified within 24 months | | Active Account | Company name, website, industry | Has valid URL + industry field populated + associated with 1+ active contact | | Dead Opportunity | Close date, stage, amount | Close date > 12 months ago + stage = "Closed Lost" or "Stalled" | | Vendor/Partner | Company name, relationship type | Tagged as "Vendor" or "Partner" OR domain matches known vendor list |
Use Claude Projects for bulk classification:
- Create a new Project in Claude.ai
- Upload your full CRMexport (Claude handles files up to 100MB)CRMClick to read the full definition in our AI & Automation Glossary.
- Add this instruction to Project Knowledge: "You are a CRMdata classifier. Apply the classification schema I provide to every record. Output a CSV with original record ID + assigned record type + confidence score (0-100)."CRMClick to read the full definition in our AI & Automation Glossary.
- Process in batches of 5,000 records
For Salesforce users: use a Python script with the Salesforce API and OpenAI's batch API
Step 2: Deduplicate Using Fuzzy Matching and AI Judgment
Deduplication is where most cleanup projects fail. Firms use CRM
Contact deduplication process:
- Export all contacts to CSV
- Use Python with the
fuzzywuzzylibrary or thededupelibrary (both free, open-source) - Set matching thresholds:
- Email exact match = 100% duplicate
- Phone exact match = 95% duplicate (allows for formatting differences)
- Name + company fuzzy match >85% = flag for AI review
AI review prompt for borderline matches:
I have two CRM contact records that may be duplicates:
Record A:
Name: John Smith
Email: jsmith@acmecorp.com
Phone: (512) 555-0100
Title: Partner
Company: Acme Corporation
Record B:
Name: Jonathan Smith
Email: john.smith@acme-corp.com
Phone: 512-555-0100
Title: Managing Partner
Company: ACME Corp
Are these the same person? If yes, which record has more complete/accurate data? Respond with: DUPLICATE - Keep Record [A/B] OR NOT_DUPLICATE.
Run this through Claude's API
Account deduplication is harder:
Company names are a nightmare. "PricewaterhouseCoopers LLP", "PwC", "PricewaterhouseCoopers", and "PWC US" are all the same firm.
Use Clearbit's Company API
Merge strategy:
- Keep the record with the most recent activity date as the master
- Append notes/custom fields from duplicate records to the master record's activity history
- Reassign all related opportunities, cases, and activities to the master record
- Mark duplicates as "Merged - Do Not Use" before deleting (keep for 90 days as backup)
Step 3: Standardize Formatting and Field Values
Standardization is not about making data "pretty". It's about making it queryable and reportable.
Phone number standardization:
Use the phonenumbers Python library (Google's libphonenumber). It handles international formats, extensions, and validation.
import phonenumbers
def standardize_phone(phone_string, default_region='US'):
try:
parsed = phonenumbers.parse(phone_string, default_region)
return phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)
except:
return None
Output format: +15125550100 (E.164 international standard)
Company name standardization:
- Remove legal suffixes: LLC, Inc., Corp., Ltd., LLP (use regex)
- Expand common abbreviations: "Intl" → "International", "Mfg" → "Manufacturing"
- Use OpenAI's APIto resolve ambiguous cases:APIClick to read the full definition in our AI & Automation Glossary.
Standardize this company name to its official form: "PWC Advisory Svcs LLC"
Return only the standardized name, no explanation.
Expected output: "PricewaterhouseCoopers"
Industry classification:
Most CRMs
Use this Claude prompt for bulk reclassification:
Reclassify these industry values into one of these 15 standard categories: [list your categories]
Input industries: [paste 50-100 at a time]
Output format: Original Industry | Standard Category
Address standardization:
Use the Google Maps Geocoding API
This gives you clean, standardized addresses plus latitude/longitude for territory mapping.
Step 4: Enrich with External Data Sources
Enrichment is not optional. Your CRM
Contact enrichment sources:
- Apollo.io: 10 free credits/month, then $49/month for 1,000 credits. Returns email, phone, title, LinkedIn URL.
- Hunter.io: Email finder and verification. 25 free searches/month.
- LinkedIn Sales Navigator: Manual enrichment for high-value contacts. $99/month per seat.
Account enrichment sources:
- Clearbit Enrichment API: $99/month for 2,500 lookups. Returns employee count, revenue, tech stack, funding.APIClick to read the full definition in our AI & Automation Glossary.
- ZoomInfo: Enterprise pricing (expect $15K-30K/year). Most complete B2B database.
- BuiltWith: Technology stack data. $295/month for APIaccess.APIClick to read the full definition in our AI & Automation Glossary.
Enrichment workflow:
- Export accounts missing key fields (employee count, revenue, industry)
- Run through Clearbit APIfirst (cheapest, good coverage for US companies)APIClick to read the full definition in our AI & Automation Glossary.
- For non-matches, try ZoomInfo or manual LinkedIn research
- Use Claude to extract data from company websites:
Visit this company website: [URL]
Extract and return in JSON format:
- Employee count (estimate if not stated)
- Primary industry
- Headquarters location
- Key products/services (max 3)
If information is not available, return null for that field.
Enrichment adds 15-20 hours of work but doubles the usability of your CRM
Step 5: Validate and Lock Down Data Quality
Cleanup is worthless if your team immediately re-corrupts the data.
Implement these controls:
- Required fields at creation: Make industry, phone, and company website mandatory for new accounts
- Picklist restrictions: Convert free-text fields to picklists wherever possible (industry, lead source, account type)
- Validation rules: Block saving a contact without a valid email format or phone number format
- Duplicate prevention: Enable Salesforce's native duplicate matching or HubSpot's duplicate management (post-cleanup, these tools work well)
Create a data quality dashboard:
Track these metrics weekly:
- Percentage of contacts with valid email + phone
- Percentage of accounts with industry + employee count
- Number of duplicate records created (should be <5/week)
- Percentage of opportunities with next step + close date
Assign one person as data quality owner. This is a 2-4 hour/week role, not a full-time job.
Tools and Costs Summary
Free/low-cost tools:
- Claude.ai Projects: $20/month for bulk classification and enrichment prompts
- Python libraries (fuzzywuzzy, phonenumbers, dedupe): Free
- Google Sheets + Apps Script: Free, good for small datasets (<10K records)
Paid tools worth the investment:
- Clearbit: $99/month (account enrichment)
- Apollo.io: $49/month (contact enrichment)
- Zapier or Make.com: $20-50/month (automation for ongoing data quality)
Total cost for a 20,000-record CRM
What Happens After Cleanup
You now have a CRM
- Every contact has a valid email and phone number
- Every account has industry, size, and location data
- Duplicate records are eliminated
- Field formatting is consistent and queryable
This is the foundation. Now you can build:
- Marketing automation that doesn't embarrass you
- Pipeline reports that match reality
- AI assistants that pull accurate data
- Territory assignments that make sense
Firms that skip cleanup spend 18 months fighting their CRM
Clean your data first. Build everything else second.

Reviewed by Revenue Institute
This guide is actively maintained and reviewed by the implementation experts at Revenue Institute. As the creators of The AI Workforce Playbook, we test and deploy these exact frameworks for professional services firms scaling without new headcount.
Revenue Institute
Need help turning this guide into reality? Revenue Institute builds and implements the AI workforce for professional services firms.