Security & Compliance

PII Scrubbing Guide for AI Workflows

How to use AI to redact PII before sending data to tools that may train on it.

PII Scrubbing Guide for AI Workflows

You cannot send client data to ChatGPT, Claude, or most AI tools without scrubbing PII first. Period. Most AI vendors explicitly state in their terms that they may use your inputs for model training. For law firms, accounting practices, and consulting shops handling confidential client information, this creates immediate compliance exposure under GDPR, CCPA, HIPAA, and attorney-client privilege rules.

This guide shows you how to build a PII scrubbing pipeline that runs before any data touches an AI system. You'll learn which tools actually work, how to configure them for professional services data, and how to validate that scrubbing worked.

What Counts as PII in Professional Services

Before you scrub anything, know what you're looking for. Professional services firms handle PII that goes beyond the obvious names and emails.

Client Identifiers:

Full legal names (individuals and entities)
Email addresses and phone numbers
Physical addresses
Tax IDs (SSN, EIN, VAT numbers)
Client matter numbers
Account numbers

Financial Data:

Bank account and routing numbers
Credit card numbers (full or partial)
Wire transfer details
Invoice amounts tied to specific clients
Salary and compensation figures

Legal and Health Information:

Case numbers and docket references
Medical record numbers
Insurance policy numbers
Biometric data (rare but present in some cases)

Digital Identifiers:

IP addresses
Device IDs
Session tokens
API keys embedded in logs

Make a spreadsheet. List every PII type your firm handles. Note which systems contain each type. This becomes your detection configuration map.

Choose Your PII Detection Stack

You need two layers: automated detection and validation. Here are the tools that actually perform at production scale.

Google Cloud DLP API APIApplication Programming Interface. The connection point that lets two pieces of software exchange data. How n8n talks to your CRM. (best for multi-format data)

Handles 150+ PII types out of the box. Supports structured data (CSV, JSON), unstructured text, and images. Pricing: $1 per GB for inspection, $0.30 per GB for de-identification.

Configuration for law firms:

Enable custom info types for matter numbers (regex: [A-Z]{2,4}-\d{4,6})
Set likelihood threshold to "POSSIBLE" (not just "LIKELY") to catch edge cases
Use context-aware detection for names (reduces false positives on common words)

Microsoft Presidio (best for on-premise deployments)

Open-source PII detection and anonymization. Runs locally, so no data leaves your infrastructure. Supports 20+ languages.

Use case: Firms with strict data residency requirements or those processing data in EU/UK jurisdictions where cloud transfer creates compliance friction.

AWS Comprehend PII (best for AWS-native workflows)

Detects PII in real-time with sub-second latency. Integrates directly with S3, Lambda, and Textract. Pricing: $0.0001 per unit (100 characters).

Limitation: Only supports English, Spanish, French, German, Italian, Portuguese, and Japanese. If you handle documents in other languages, use Google DLP or Presidio.

Nightfall AI (best for SaaS integrations)

Pre-built connectors for email, Google Drive, Salesforce, and email. Useful if you're scrubbing PII from collaboration tools before feeding conversation data to AI assistants.

Pricing starts at $500/month for 10 users. Expensive for small firms, but faster to deploy than building custom integrations.

Build Your Scrubbing Pipeline

Here's a production-ready workflow using Google Cloud DLP. Adapt the logic for other tools.

Step 1: Set Up Detection Templates

Create a DLP inspection template that defines what to find.

from google.cloud import dlp_v2

def create_inspection_template(project_id):
    dlp = dlp_v2.DlpServiceClient()
    parent = f"projects/{project_id}/locations/global"
    
    # Define info types to detect
    info_types = [
        {"name": "PERSON_NAME"},
        {"name": "EMAIL_ADDRESS"},
        {"name": "PHONE_NUMBER"},
        {"name": "US_SOCIAL_SECURITY_NUMBER"},
        {"name": "CREDIT_CARD_NUMBER"},
        {"name": "IBAN_CODE"},
        {"name": "IP_ADDRESS"},
    ]
    
    # Add custom detector for matter numbers
    custom_info_types = [
        {
            "info_type": {"name": "MATTER_NUMBER"},
            "regex": {"pattern": r"[A-Z]{2,4}-\d{4,6}"},
            "likelihood": dlp_v2.Likelihood.POSSIBLE,
        }
    ]
    
    inspect_config = {
        "info_types": info_types,
        "custom_info_types": custom_info_types,
        "min_likelihood": dlp_v2.Likelihood.POSSIBLE,
        "include_quote": True,  # Return actual text found
    }
    
    template = {
        "inspect_config": inspect_config,
        "display_name": "Professional Services PII Template",
    }
    
    response = dlp.create_inspect_template(
        request={"parent": parent, "inspect_template": template}
    )
    
    return response.name

Step 2: Inspect and Redact in One Pass

Use de-identification transformations to replace PII with placeholders or hashed values.

def scrub_pii_from_text(project_id, text_content, template_name):
    dlp = dlp_v2.DlpServiceClient()
    parent = f"projects/{project_id}/locations/global"
    
    # Define de-identification config
    deidentify_config = {
        "info_type_transformations": {
            "transformations": [
                {
                    "primitive_transformation": {
                        "replace_with_info_type_config": {}  # Replace with [PERSON_NAME], [EMAIL_ADDRESS], etc.
                    }
                }
            ]
        }
    }
    
    # Construct the item to inspect
    item = {"value": text_content}
    
    # Call the API
    response = dlp.deidentify_content(
        request={
            "parent": parent,
            "deidentify_config": deidentify_config,
            "inspect_template_name": template_name,
            "item": item,
        }
    )
    
    return response.item.value

Step 3: Process Files in Batch

For large document sets (discovery materials, email archives), use batch processing.

def batch_scrub_gcs_files(project_id, bucket_name, template_name):
    dlp = dlp_v2.DlpServiceClient()
    parent = f"projects/{project_id}/locations/global"
    
    # Input: GCS bucket with original files
    storage_config = {
        "cloud_storage_options": {
            "file_set": {"url": f"gs://{bucket_name}/*"}
        }
    }
    
    # Output: Write scrubbed files to new bucket
    output_config = {
        "output_schema": dlp_v2.OutputStorageConfig.OutputSchema.ALL_SUPPORTED_TYPES,
        "table": {
            "project_id": project_id,
            "dataset_id": "scrubbed_data",
            "table_id": f"scrubbed_{bucket_name}",
        },
    }
    
    # De-identification config (same as above)
    deidentify_config = {
        "info_type_transformations": {
            "transformations": [
                {
                    "primitive_transformation": {
                        "replace_with_info_type_config": {}
                    }
                }
            ]
        }
    }
    
    # Create the job
    job_config = {
        "inspect_template_name": template_name,
        "storage_config": storage_config,
        "deidentify_config": deidentify_config,
        "actions": [{"save_findings": {"output_config": output_config}}],
    }
    
    response = dlp.create_dlp_job(
        request={"parent": parent, "job": job_config}
    )
    
    return response.name

Step 4: Validate Scrubbing Results

Never trust automation alone. Run validation checks.

def validate_scrubbing(original_text, scrubbed_text, expected_pii_types):
    ""
    Check that scrubbed text contains no PII.
    Returns list of validation failures.
    ""
    failures = []
    
    # Re-inspect the scrubbed text
    dlp = dlp_v2.DlpServiceClient()
    inspect_config = {
        "info_types": [{"name": pii_type} for pii_type in expected_pii_types],
        "min_likelihood": dlp_v2.Likelihood.POSSIBLE,
    }
    
    item = {"value": scrubbed_text}
    response = dlp.inspect_content(
        request={
            "parent": f"projects/{project_id}/locations/global",
            "inspect_config": inspect_config,
            "item": item,
        }
    )
    
    # If any findings remain, scrubbing failed
    if response.result.findings:
        for finding in response.result.findings:
            failures.append({
                "type": finding.info_type.name,
                "quote": finding.quote,
                "likelihood": finding.likelihood.name,
            })
    
    return failures

Step 5: Integrate with AI Workflow

Only send scrubbed data to AI tools. Here's a complete example for processing client emails before summarization.

def process_email_for_ai_summary(email_text, project_id, template_name):
    # Step 1: Scrub PII
    scrubbed_email = scrub_pii_from_text(project_id, email_text, template_name)
    
    # Step 2: Validate
    failures = validate_scrubbing(
        email_text, 
        scrubbed_email, 
        ["PERSON_NAME", "EMAIL_ADDRESS", "PHONE_NUMBER"]
    )
    
    if failures:
        raise ValueError(f"PII scrubbing failed: {failures}")
    
    # Step 3: Send to AI (example with OpenAI)
    import openai
    
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Summarize this email in 3 bullet points."},
            {"role": "user", "content": scrubbed_email}
        ]
    )
    
    return response.choices[0].message.content

Handle Edge Cases

Partial Redaction for Context Preservation

Sometimes you need to keep partial information for the AI to understand context. Use character masking instead of full replacement.

deidentify_config = {
    "info_type_transformations": {
        "transformations": [
            {
                "info_types": [{"name": "EMAIL_ADDRESS"}],
                "primitive_transformation": {
                    "character_mask_config": {
                        "masking_character": "*",
                        "number_to_mask": 0,  # Mask all characters
                        "reverse_order": False,
                        "characters_to_ignore": [
                            {"characters_to_skip": "@."}  # Keep domain visible
                        ],
                    }
                },
            }
        ]
    }
}

Result: john.doe@lawfirm.com becomes ********@lawfirm.com

Pseudonymization for Consistent References

If the AI needs to track the same person across multiple documents, use crypto-based pseudonymization.

deidentify_config = {
    "info_type_transformations": {
        "transformations": [
            {
                "info_types": [{"name": "PERSON_NAME"}],
                "primitive_transformation": {
                    "crypto_hash_config": {
                        "crypto_key": {
                            "kms_wrapped": {
                                "wrapped_key": base64.b64encode(encryption_key),
                                "crypto_key_name": f"projects/{project_id}/locations/global/keyRings/dlp/cryptoKeys/pii",
                            }
                        }
                    }
                },
            }
        ]
    }
}

Result: "John Doe" always becomes the same hash (e.g., CLIENT_a3f8b9c2), so the AI can track references without knowing the real name.

Monitor and Audit

Set up logging to track every scrubbing operation. You need this for compliance audits.

import logging
from google.cloud import logging as cloud_logging

def log_scrubbing_operation(original_hash, scrubbed_hash, pii_types_found):
    client = cloud_logging.Client()
    logger = client.logger("pii-scrubbing")
    
    logger.log_struct({
        "operation": "pii_scrubbing",
        "original_data_hash": original_hash,
        "scrubbed_data_hash": scrubbed_hash,
        "pii_types_detected": pii_types_found,
        "timestamp": datetime.utcnow().isoformat(),
    })

Create a dashboard that shows:

Total documents processed
PII types detected (frequency distribution)
Validation failures
Processing time per document

Review this monthly. If you see new PII types appearing frequently, update your detection templates.

Cost and Performance Benchmarks

Based on real-world professional services deployments:

Google Cloud DLP:

1,000 emails (avg 2KB each): $2 inspection + $0.60 de-identification = $2.60
Processing time: 0.3 seconds per email
Monthly cost for 50,000 emails: $130

AWS Comprehend:

1,000 emails (avg 2KB each): $2 (2MB total at $0.0001 per 100 chars)
Processing time: 0.1 seconds per email
Monthly cost for 50,000 emails: $100

Self-hosted Presidio:

Infrastructure: $200/month (2 vCPU, 8GB RAM instance)
Processing time: 0.5 seconds per email
No per-document fees

For firms processing under 10,000 documents monthly, use cloud APIs. For higher volumes or strict data residency needs, self-host Presidio.

Pre-Flight Checklist

Before you deploy PII scrubbing to production:

Test with 100 real client documents. Manually review every scrubbed output.
Confirm your validation step catches at least 95% of residual PII (run it on intentionally under-scrubbed test data).
Document which AI tools receive scrubbed data and verify their data retention policies.
Add scrubbing logs to your firm's compliance monitoring dashboard.
Train staff to never bypass the scrubbing pipeline, even for "quick tests."

PII scrubbing is not optional. It's the technical control that makes AI usable in professional services. Build it once, validate it thoroughly, and enforce it everywhere.

Frequently Asked Questions

Do I need to scrub PII before sending data to AI tools like ChatGPT or Claude? Yes, for client data in professional services. Most AI vendors may use your inputs for model training. This creates compliance exposure under GDPR, CCPA, HIPAA, and attorney-client privilege rules. Build a PII scrubbing pipeline that runs before any client data touches a public AI system, or use AI tools with enterprise zero-data-retention agreements and a Data Processing Addendum.

What types of PII do professional services firms need to scrub? Six categories: Client identifiers (names, emails, phone numbers, tax IDs, matter numbers), financial data (bank accounts, credit cards, invoice amounts), legal/health information (case numbers, medical record numbers), digital identifiers (IP addresses, session tokens, API keys in logs), salary/compensation data, and any data covered by your jurisdiction's privacy laws.

What tools are best for PII scrubbing in AI workflows? Four main options: (1) Google Cloud DLP - 150+ PII types, $1/GB inspection. Best for multi-format data. (2) Microsoft Presidio - open-source, runs locally. Best for strict data residency. (3) AWS Comprehend PII - sub-second latency, native AWS integration. (4) Nightfall AI - pre-built connectors for email, Google Drive, Salesforce. $500/month starting price.

How much does PII scrubbing cost in AI workflows? Google Cloud DLP: ~$130/month for 50,000 emails. AWS Comprehend: ~$100/month. Self-hosted Presidio: $200/month in infrastructure regardless of volume (cost-effective above 50,000 documents/month). For under 10,000 documents monthly, use cloud APIs. For strict data residency needs, self-host Presidio.

Related Resources

Security & Compliance

AI Incident Response Plan Template

What to do when automation sends wrong message, workflow breaks, data is misrouted. Step-by-step response.

Security & Compliance

AI Use Policy: Full Corporate Template

The complete corporate AI use policy template - approved tools, data boundaries, approval workflow, incident reporting, and disciplinary clauses ready for legal review.

Security & Compliance

Data Processing Agreement (DPA) Review Guide

What to look for in vendor DPAs, red flags, questions to ask. Non-legal-advice framing.

Get the Book

The full system, end to end.

Looking to build your AI workforce? Get the comprehensive guide for professional services - the 12 plays, the frameworks, and the field-tested playbooks.

Buy on Amazon

Reviewed by Revenue Institute

This guide is actively maintained and reviewed by the implementation experts at Revenue Institute. As the creators of The AI Workforce Playbook, we test and deploy these exact frameworks for professional services firms scaling without new headcount.

Visit Revenue Institute →Get the Book →

Get the Book

Looking to build your AI workforce? Get the comprehensive guide for professional services.

Buy on Amazon

Need help turning this guide into reality?

Revenue Institute builds and implements the AI workforce for professional services firms.

Work with Revenue Institute