Back to Security & Compliance
Security & Compliance

PII Scrubbing Guide for AI Workflows

How to use AI to redact PII before sending data to tools that may train on it.

PII Scrubbing Guide for AI Workflows

You cannot send client data to ChatGPT, Claude, or most AI tools without scrubbing PII first. Period. Most AI vendors explicitly state in their terms that they may use your inputs for model training. For law firms, accounting practices, and consulting shops handling confidential client information, this creates immediate compliance exposure under GDPR, CCPA, HIPAA, and attorney-client privilege rules.

This guide shows you how to build a PII scrubbing pipeline that runs before any data touches an AI system. You'll learn which tools actually work, how to configure them for professional services data, and how to validate that scrubbing worked.

What Counts as PII in Professional Services

Before you scrub anything, know what you're looking for. Professional services firms handle PII that goes beyond the obvious names and emails.

Client Identifiers:

  • Full legal names (individuals and entities)
  • Email addresses and phone numbers
  • Physical addresses
  • Tax IDs (SSN, EIN, VAT numbers)
  • Client matter numbers
  • Account numbers

Financial Data:

  • Bank account and routing numbers
  • Credit card numbers (full or partial)
  • Wire transfer details
  • Invoice amounts tied to specific clients
  • Salary and compensation figures

Legal and Health Information:

  • Case numbers and docket references
  • Medical record numbers
  • Insurance policy numbers
  • Biometric data (rare but present in some cases)

Digital Identifiers:

  • IP addresses
  • Device IDs
  • Session tokens
  • API keys embedded in logs

Make a spreadsheet. List every PII type your firm handles. Note which systems contain each type. This becomes your detection configuration map.

Choose Your PII Detection Stack

You need two layers: automated detection and validation. Here are the tools that actually perform at production scale.

Google Cloud DLP API

(best for multi-format data)

Handles 150+ PII types out of the box. Supports structured data (CSV, JSON), unstructured text, and images. Pricing: $1 per GB for inspection, $0.30 per GB for de-identification.

Configuration for law firms:

  • Enable custom info types for matter numbers (regex: [A-Z]{2,4}-\d{4,6})
  • Set likelihood threshold to "POSSIBLE" (not just "LIKELY") to catch edge cases
  • Use context-aware detection for names (reduces false positives on common words)

Microsoft Presidio (best for on-premise deployments)

Open-source PII detection and anonymization. Runs locally, so no data leaves your infrastructure. Supports 20+ languages.

Use case: Firms with strict data residency requirements or those processing data in EU/UK jurisdictions where cloud transfer creates compliance friction.

AWS Comprehend PII (best for AWS-native workflows)

Detects PII in real-time with sub-second latency. Integrates directly with S3, Lambda, and Textract. Pricing: $0.0001 per unit (100 characters).

Limitation: Only supports English, Spanish, French, German, Italian, Portuguese, and Japanese. If you handle documents in other languages, use Google DLP or Presidio.

Nightfall AI (best for SaaS integrations)

Pre-built connectors for Slack, Google Drive, Salesforce, and email. Useful if you're scrubbing PII from collaboration tools before feeding conversation data to AI assistants.

Pricing starts at $500/month for 10 users. Expensive for small firms, but faster to deploy than building custom integrations.

Build Your Scrubbing Pipeline

Here's a production-ready workflow using Google Cloud DLP. Adapt the logic for other tools.

Step 1: Set Up Detection Templates

Create a DLP inspection template that defines what to find.

from google.cloud import dlp_v2

def create_inspection_template(project_id):
    dlp = dlp_v2.DlpServiceClient()
    parent = f"projects/{project_id}/locations/global"
    
    # Define info types to detect
    info_types = [
        {"name": "PERSON_NAME"},
        {"name": "EMAIL_ADDRESS"},
        {"name": "PHONE_NUMBER"},
        {"name": "US_SOCIAL_SECURITY_NUMBER"},
        {"name": "CREDIT_CARD_NUMBER"},
        {"name": "IBAN_CODE"},
        {"name": "IP_ADDRESS"},
    ]
    
    # Add custom detector for matter numbers
    custom_info_types = [
        {
            "info_type": {"name": "MATTER_NUMBER"},
            "regex": {"pattern": r"[A-Z]{2,4}-\d{4,6}"},
            "likelihood": dlp_v2.Likelihood.POSSIBLE,
        }
    ]
    
    inspect_config = {
        "info_types": info_types,
        "custom_info_types": custom_info_types,
        "min_likelihood": dlp_v2.Likelihood.POSSIBLE,
        "include_quote": True,  # Return actual text found
    }
    
    template = {
        "inspect_config": inspect_config,
        "display_name": "Professional Services PII Template",
    }
    
    response = dlp.create_inspect_template(
        request={"parent": parent, "inspect_template": template}
    )
    
    return response.name

Step 2: Inspect and Redact in One Pass

Use de-identification transformations to replace PII with placeholders or hashed values.

def scrub_pii_from_text(project_id, text_content, template_name):
    dlp = dlp_v2.DlpServiceClient()
    parent = f"projects/{project_id}/locations/global"
    
    # Define de-identification config
    deidentify_config = {
        "info_type_transformations": {
            "transformations": [
                {
                    "primitive_transformation": {
                        "replace_with_info_type_config": {}  # Replace with [PERSON_NAME], [EMAIL_ADDRESS], etc.
                    }
                }
            ]
        }
    }
    
    # Construct the item to inspect
    item = {"value": text_content}
    
    # Call the API
    response = dlp.deidentify_content(
        request={
            "parent": parent,
            "deidentify_config": deidentify_config,
            "inspect_template_name": template_name,
            "item": item,
        }
    )
    
    return response.item.value

Step 3: Process Files in Batch

For large document sets (discovery materials, email archives), use batch processing.

def batch_scrub_gcs_files(project_id, bucket_name, template_name):
    dlp = dlp_v2.DlpServiceClient()
    parent = f"projects/{project_id}/locations/global"
    
    # Input: GCS bucket with original files
    storage_config = {
        "cloud_storage_options": {
            "file_set": {"url": f"gs://{bucket_name}/*"}
        }
    }
    
    # Output: Write scrubbed files to new bucket
    output_config = {
        "output_schema": dlp_v2.OutputStorageConfig.OutputSchema.ALL_SUPPORTED_TYPES,
        "table": {
            "project_id": project_id,
            "dataset_id": "scrubbed_data",
            "table_id": f"scrubbed_{bucket_name}",
        },
    }
    
    # De-identification config (same as above)
    deidentify_config = {
        "info_type_transformations": {
            "transformations": [
                {
                    "primitive_transformation": {
                        "replace_with_info_type_config": {}
                    }
                }
            ]
        }
    }
    
    # Create the job
    job_config = {
        "inspect_template_name": template_name,
        "storage_config": storage_config,
        "deidentify_config": deidentify_config,
        "actions": [{"save_findings": {"output_config": output_config}}],
    }
    
    response = dlp.create_dlp_job(
        request={"parent": parent, "job": job_config}
    )
    
    return response.name

Step 4: Validate Scrubbing Results

Never trust automation alone. Run validation checks.

def validate_scrubbing(original_text, scrubbed_text, expected_pii_types):
    """
    Check that scrubbed text contains no PII.
    Returns list of validation failures.
    """
    failures = []
    
    # Re-inspect the scrubbed text
    dlp = dlp_v2.DlpServiceClient()
    inspect_config = {
        "info_types": [{"name": pii_type} for pii_type in expected_pii_types],
        "min_likelihood": dlp_v2.Likelihood.POSSIBLE,
    }
    
    item = {"value": scrubbed_text}
    response = dlp.inspect_content(
        request={
            "parent": f"projects/{project_id}/locations/global",
            "inspect_config": inspect_config,
            "item": item,
        }
    )
    
    # If any findings remain, scrubbing failed
    if response.result.findings:
        for finding in response.result.findings:
            failures.append({
                "type": finding.info_type.name,
                "quote": finding.quote,
                "likelihood": finding.likelihood.name,
            })
    
    return failures

Step 5: Integrate with AI Workflow

Only send scrubbed data to AI tools. Here's a complete example for processing client emails before summarization.

def process_email_for_ai_summary(email_text, project_id, template_name):
    # Step 1: Scrub PII
    scrubbed_email = scrub_pii_from_text(project_id, email_text, template_name)
    
    # Step 2: Validate
    failures = validate_scrubbing(
        email_text, 
        scrubbed_email, 
        ["PERSON_NAME", "EMAIL_ADDRESS", "PHONE_NUMBER"]
    )
    
    if failures:
        raise ValueError(f"PII scrubbing failed: {failures}")
    
    # Step 3: Send to AI (example with OpenAI)
    import openai
    
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Summarize this email in 3 bullet points."},
            {"role": "user", "content": scrubbed_email}
        ]
    )
    
    return response.choices[0].message.content

Handle Edge Cases

Partial Redaction for Context Preservation

Sometimes you need to keep partial information for the AI to understand context. Use character masking instead of full replacement.

deidentify_config = {
    "info_type_transformations": {
        "transformations": [
            {
                "info_types": [{"name": "EMAIL_ADDRESS"}],
                "primitive_transformation": {
                    "character_mask_config": {
                        "masking_character": "*",
                        "number_to_mask": 0,  # Mask all characters
                        "reverse_order": False,
                        "characters_to_ignore": [
                            {"characters_to_skip": "@."}  # Keep domain visible
                        ],
                    }
                },
            }
        ]
    }
}

Result: john.doe@lawfirm.com becomes ********@lawfirm.com

Pseudonymization for Consistent References

If the AI needs to track the same person across multiple documents, use crypto-based pseudonymization.

deidentify_config = {
    "info_type_transformations": {
        "transformations": [
            {
                "info_types": [{"name": "PERSON_NAME"}],
                "primitive_transformation": {
                    "crypto_hash_config": {
                        "crypto_key": {
                            "kms_wrapped": {
                                "wrapped_key": base64.b64encode(encryption_key),
                                "crypto_key_name": f"projects/{project_id}/locations/global/keyRings/dlp/cryptoKeys/pii",
                            }
                        }
                    }
                },
            }
        ]
    }
}

Result: "John Doe" always becomes the same hash (e.g., CLIENT_a3f8b9c2), so the AI can track references without knowing the real name.

Monitor and Audit

Set up logging to track every scrubbing operation. You need this for compliance audits.

import logging
from google.cloud import logging as cloud_logging

def log_scrubbing_operation(original_hash, scrubbed_hash, pii_types_found):
    client = cloud_logging.Client()
    logger = client.logger("pii-scrubbing")
    
    logger.log_struct({
        "operation": "pii_scrubbing",
        "original_data_hash": original_hash,
        "scrubbed_data_hash": scrubbed_hash,
        "pii_types_detected": pii_types_found,
        "timestamp": datetime.utcnow().isoformat(),
    })

Create a dashboard that shows:

  • Total documents processed
  • PII types detected (frequency distribution)
  • Validation failures
  • Processing time per document

Review this monthly. If you see new PII types appearing frequently, update your detection templates.

Cost and Performance Benchmarks

Based on real-world professional services deployments:

Google Cloud DLP:

  • 1,000 emails (avg 2KB each): $2 inspection + $0.60 de-identification = $2.60
  • Processing time: 0.3 seconds per email
  • Monthly cost for 50,000 emails: $130

AWS Comprehend:

  • 1,000 emails (avg 2KB each): $2 (2MB total at $0.0001 per 100 chars)
  • Processing time: 0.1 seconds per email
  • Monthly cost for 50,000 emails: $100

Self-hosted Presidio:

  • Infrastructure: $200/month (2 vCPU, 8GB RAM instance)
  • Processing time: 0.5 seconds per email
  • No per-document fees

For firms processing under 10,000 documents monthly, use cloud APIs

. For higher volumes or strict data residency needs, self-host Presidio.

Pre-Flight Checklist

Before you deploy PII scrubbing to production:

  1. Test with 100 real client documents. Manually review every scrubbed output.
  2. Confirm your validation step catches at least 95% of residual PII (run it on intentionally under-scrubbed test data).
  3. Document which AI tools receive scrubbed data and verify their data retention policies.
  4. Add scrubbing logs to your firm's compliance monitoring dashboard.
  5. Train staff to never bypass the scrubbing pipeline, even for "quick tests."

PII scrubbing is not optional. It's the technical control that makes AI usable in professional services. Build it once, validate it thoroughly, and enforce it everywhere.

Revenue Institute

Reviewed by Revenue Institute

This guide is actively maintained and reviewed by the implementation experts at Revenue Institute. As the creators of The AI Workforce Playbook, we test and deploy these exact frameworks for professional services firms scaling without new headcount.

Revenue Institute

Need help turning this guide into reality? Revenue Institute builds and implements the AI workforce for professional services firms.

RevenueInstitute.com