Back to Play 11 Resources
Play 11: Knowledge Base Q&A

Neo4j Knowledge Graph Setup Guide (Optional Advanced)

For firms wanting relational understanding on top of vector search.

Neo4j Knowledge Graph Setup Guide (Optional Advanced)

Vector search finds similar documents. Knowledge graphs answer "who worked with whom on what" and "which clients have overlapping needs." If your firm needs to surface relationships between clients, matters, expertise areas, and precedents, Neo4j adds a relational layer that vector embeddings can't provide.

This guide walks you through deploying Neo4j, modeling your firm's data as a graph, and querying it alongside your vector store. You'll build a working knowledge graph in 2-3 hours.

When You Actually Need This

Skip this if you're just building document Q&A. Add Neo4j when you need to answer:

  • "Which partners have worked on SEC compliance matters for fintech clients in the last 18 months?"
  • "Show me all engagements where Sarah Chen and Michael Torres collaborated."
  • "Which clients share the same industry, revenue band, and regulatory challenges?"

If your queries are purely content-based ("What does our M&A playbook say about due diligence?"), stick with vector search alone.

Prerequisites

Neo4j Instance
Use Neo4j Aura (managed cloud) for production. Free tier supports 200k nodes and 400k relationships. Sign up at console.neo4j.io. For local testing, run docker run -p 7474:7474 -p 7687:7687 neo4j:5.15.0.

Python Environment
Install the Neo4j driver: pip install neo4j pandas. You'll use Python to load data and run queries.

Data Sources
Export CSVs from your practice management system (Clio, PracticePanther) or CRM

(Salesforce, HubSpot). You need: client records, matter/engagement data, timekeeper assignments, and practice area tags.

Cypher Basics
Neo4j's query language. You'll learn enough in this guide, but skim the Cypher cheat sheet first.

Step 1: Design Your Graph Schema

Map your firm's data to nodes (entities) and relationships (connections). Start simple. You can always add complexity later.

Core Node Types

  1. Client
    Properties: clientId, name, industry, revenue, location, riskProfile

  2. Matter
    Properties: matterId, name, practiceArea, startDate, endDate, status, billedAmount

  3. Timekeeper
    Properties: timekeeperId, name, title, office, barAdmissions[], practiceAreas[]

  4. Document
    Properties: docId, title, docType, createdDate, vectorId (links to your Pinecone/Weaviate record)

Core Relationship Types

  • (Client)-[:RETAINED_FOR]->(Matter)
  • (Timekeeper)-[:WORKED_ON {hours: 45.5, role: "Lead Counsel"}]->(Matter)
  • (Timekeeper)-[:SPECIALIZES_IN]->(PracticeArea)
  • (Matter)-[:PRODUCED]->(Document)
  • (Client)-[:REFERRED_BY]->(Client)

Draw this on paper or use arrows.app before writing any code.

Step 2: Set Up Neo4j Connection

Create neo4j_setup.py:

from neo4j import GraphDatabase
import os

class Neo4jConnection:
    def __init__(self):
        uri = os.getenv("NEO4J_URI", "neo4j+s://xxxxx.databases.neo4j.io")
        user = os.getenv("NEO4J_USER", "neo4j")
        password = os.getenv("NEO4J_PASSWORD")
        self.driver = GraphDatabase.driver(uri, auth=(user, password))
    
    def close(self):
        self.driver.close()
    
    def run_query(self, query, parameters=None):
        with self.driver.session() as session:
            result = session.run(query, parameters)
            return [record.data() for record in result]

# Test connection
conn = Neo4jConnection()
result = conn.run_query("RETURN 'Connection successful' AS message")
print(result)
conn.close()

Set environment variables in .env:

NEO4J_URI=neo4j+s://your-instance.databases.neo4j.io
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-password

Run python neo4j_setup.py. You should see [{'message': 'Connection successful'}].

Step 3: Create Constraints and Indexes

Constraints prevent duplicate nodes. Indexes speed up lookups. Run these in Neo4j Browser (browser tab at your Aura instance URL) or via Python:

CREATE CONSTRAINT client_id IF NOT EXISTS
FOR (c:Client) REQUIRE c.clientId IS UNIQUE;

CREATE CONSTRAINT matter_id IF NOT EXISTS
FOR (m:Matter) REQUIRE m.matterId IS UNIQUE;

CREATE CONSTRAINT timekeeper_id IF NOT EXISTS
FOR (t:Timekeeper) REQUIRE t.timekeeperId IS UNIQUE;

CREATE CONSTRAINT document_id IF NOT EXISTS
FOR (d:Document) REQUIRE d.docId IS UNIQUE;

CREATE INDEX matter_practice_area IF NOT EXISTS
FOR (m:Matter) ON (m.practiceArea);

CREATE INDEX timekeeper_name IF NOT EXISTS
FOR (t:Timekeeper) ON (t.name);

These take 5-10 seconds to create. Verify with SHOW CONSTRAINTS and SHOW INDEXES.

Step 4: Load Data from CSVs

Export your data to CSVs. Place them in a data/ folder. Example structure:

clients.csv

clientId,name,industry,revenue,location
C001,Acme Manufacturing,Manufacturing,50000000,San Francisco
C002,TechStart Inc,Technology,5000000,Austin

matters.csv

matterId,clientId,name,practiceArea,startDate,endDate,billedAmount
M001,C001,Supply Chain Dispute,Litigation,2023-01-15,2023-09-30,125000
M002,C002,Series A Financing,Corporate,2023-03-01,2023-04-15,45000

timekeepers.csv

timekeeperId,name,title,office,practiceAreas
T001,Sarah Chen,Partner,San Francisco,Litigation;Employment
T002,Michael Torres,Senior Associate,Austin,Corporate;Securities

matter_assignments.csv

matterId,timekeeperId,hours,role
M001,T001,87.5,Lead Counsel
M001,T002,12.0,Research Support
M002,T002,34.5,Lead Counsel

Load clients:

import pandas as pd

def load_clients(conn, csv_path):
    df = pd.read_csv(csv_path)
    query = """
    UNWIND $rows AS row
    MERGE (c:Client {clientId: row.clientId})
    SET c.name = row.name,
        c.industry = row.industry,
        c.revenue = toInteger(row.revenue),
        c.location = row.location
    """
    conn.run_query(query, {"rows": df.to_dict('records')})
    print(f"Loaded {len(df)} clients")

load_clients(conn, "data/clients.csv")

Load matters and create client relationships:

def load_matters(conn, csv_path):
    df = pd.read_csv(csv_path)
    query = """
    UNWIND $rows AS row
    MERGE (m:Matter {matterId: row.matterId})
    SET m.name = row.name,
        m.practiceArea = row.practiceArea,
        m.startDate = date(row.startDate),
        m.endDate = date(row.endDate),
        m.billedAmount = toFloat(row.billedAmount)
    WITH m, row
    MATCH (c:Client {clientId: row.clientId})
    MERGE (c)-[:RETAINED_FOR]->(m)
    """
    conn.run_query(query, {"rows": df.to_dict('records')})
    print(f"Loaded {len(df)} matters")

load_matters(conn, "data/matters.csv")

Load timekeepers:

def load_timekeepers(conn, csv_path):
    df = pd.read_csv(csv_path)
    # Split practiceAreas string into array
    df['practiceAreas'] = df['practiceAreas'].str.split(';')
    query = """
    UNWIND $rows AS row
    MERGE (t:Timekeeper {timekeeperId: row.timekeeperId})
    SET t.name = row.name,
        t.title = row.title,
        t.office = row.office,
        t.practiceAreas = row.practiceAreas
    """
    conn.run_query(query, {"rows": df.to_dict('records')})
    print(f"Loaded {len(df)} timekeepers")

load_timekeepers(conn, "data/timekeepers.csv")

Load matter assignments:

def load_assignments(conn, csv_path):
    df = pd.read_csv(csv_path)
    query = """
    UNWIND $rows AS row
    MATCH (t:Timekeeper {timekeeperId: row.timekeeperId})
    MATCH (m:Matter {matterId: row.matterId})
    MERGE (t)-[w:WORKED_ON]->(m)
    SET w.hours = toFloat(row.hours),
        w.role = row.role
    """
    conn.run_query(query, {"rows": df.to_dict('records')})
    print(f"Loaded {len(df)} assignments")

load_assignments(conn, "data/matter_assignments.csv")

Run all loaders in sequence. Check Neo4j Browser: MATCH (n) RETURN count(n) should show your total node count.

Step 5: Query Your Knowledge Graph

Open Neo4j Browser and run these queries to verify your data.

Find all matters for a specific client:

MATCH (c:Client {name: "Acme Manufacturing"})-[:RETAINED_FOR]->(m:Matter)
RETURN m.name, m.practiceArea, m.billedAmount
ORDER BY m.startDate DESC

Find timekeepers who worked together on multiple matters:

MATCH (t1:Timekeeper)-[:WORKED_ON]->(m:Matter)<-[:WORKED_ON]-(t2:Timekeeper)
WHERE t1.timekeeperId < t2.timekeeperId
WITH t1, t2, count(DISTINCT m) AS sharedMatters
WHERE sharedMatters >= 2
RETURN t1.name, t2.name, sharedMatters
ORDER BY sharedMatters DESC

Find clients in the same industry with similar matter types:

MATCH (c1:Client)-[:RETAINED_FOR]->(m1:Matter)
MATCH (c2:Client)-[:RETAINED_FOR]->(m2:Matter)
WHERE c1.industry = c2.industry 
  AND m1.practiceArea = m2.practiceArea
  AND c1.clientId < c2.clientId
RETURN c1.name, c2.name, c1.industry, m1.practiceArea, count(*) AS overlap
ORDER BY overlap DESC
LIMIT 10

Find the most experienced timekeeper in a practice area:

MATCH (t:Timekeeper)-[w:WORKED_ON]->(m:Matter {practiceArea: "Litigation"})
WITH t, sum(w.hours) AS totalHours, count(m) AS matterCount
RETURN t.name, t.title, totalHours, matterCount
ORDER BY totalHours DESC
LIMIT 5

Step 6: Integrate with Your Q&A System

Your RAG

pipeline now has two retrieval paths: vector search for content, graph queries for relationships.

Hybrid Retrieval Pattern

When a user asks "Who has M&A experience with healthcare clients?", your system should:

  1. Detect this is a relationship query (not a content query)
  2. Generate a Cypher query using an LLM
  3. Execute the query against Neo4j
  4. Format results as context for the final answer

Example Integration Code

from openai import OpenAI

def generate_cypher_query(user_question, schema_description):
    client = OpenAI()
    prompt = f"""You are a Cypher query generator for a law firm knowledge graph.

Schema:
{schema_description}

User question: {user_question}

Generate a Cypher query to answer this question. Return only the query, no explanation.
"""
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content.strip()

def answer_with_graph(user_question, neo4j_conn):
    schema = """
    Nodes: Client, Matter, Timekeeper, Document
    Relationships: 
    - (Client)-[:RETAINED_FOR]->(Matter)
    - (Timekeeper)-[:WORKED_ON {hours, role}]->(Matter)
    - (Matter)-[:PRODUCED]->(Document)
    """
    
    cypher_query = generate_cypher_query(user_question, schema)
    print(f"Generated query: {cypher_query}")
    
    results = neo4j_conn.run_query(cypher_query)
    
    # Format results for LLM
    context = f"Query results:\n{results}"
    
    # Generate final answer
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Answer based on the query results provided."},
            {"role": "user", "content": f"Question: {user_question}\n\n{context}"}
        ]
    )
    return response.choices[0].message.content

Query Router

Add logic to decide when to use graph vs. vector search:

def route_query(user_question):
    relationship_keywords = [
        "who worked", "which clients", "find partners", 
        "collaborated", "experience with", "similar to"
    ]
    
    if any(kw in user_question.lower() for kw in relationship_keywords):
        return "graph"
    else:
        return "vector"

Step 7: Add Document Nodes

Link your vector store records to the graph. When you ingest a document into Pinecone/Weaviate, also create a Document node in Neo4j:

def link_document_to_matter(neo4j_conn, doc_id, matter_id, title, doc_type, vector_id):
    query = """
    MERGE (d:Document {docId: $docId})
    SET d.title = $title,
        d.docType = $docType,
        d.vectorId = $vectorId,
        d.createdDate = date()
    WITH d
    MATCH (m:Matter {matterId: $matterId})
    MERGE (m)-[:PRODUCED]->(d)
    """
    neo4j_conn.run_query(query, {
        "docId": doc_id,
        "matterId": matter_id,
        "title": title,
        "docType": doc_type,
        "vectorId": vector_id
    })

Now you can query: "Show me all briefs filed in matters where Sarah Chen was lead counsel."

MATCH (t:Timekeeper {name: "Sarah Chen"})-[w:WORKED_ON {role: "Lead Counsel"}]->(m:Matter)-[:PRODUCED]->(d:Document)
WHERE d.docType = "Brief"
RETURN d.title, m.name, d.createdDate
ORDER BY d.createdDate DESC

Maintenance and Scaling

Weekly Data Sync
Schedule a cron job to re-export CSVs from your practice management system and re-run the load scripts. Use MERGE instead of CREATE to avoid duplicates.

**Performance

Revenue Institute

Reviewed by Revenue Institute

This guide is actively maintained and reviewed by the implementation experts at Revenue Institute. As the creators of The AI Workforce Playbook, we test and deploy these exact frameworks for professional services firms scaling without new headcount.

Revenue Institute

Need help turning this guide into reality? Revenue Institute builds and implements the AI workforce for professional services firms.

RevenueInstitute.com