Neo4j Knowledge Graph Setup Guide (Optional Advanced)
For firms wanting relational understanding on top of vector search.
Neo4j Knowledge Graph Setup Guide (Optional Advanced)
Vector search finds similar documents. Knowledge graphs answer "who worked with whom on what" and "which clients have overlapping needs." If your firm needs to surface relationships between clients, matters, expertise areas, and precedents, Neo4j adds a relational layer that vector embeddings can't provide.
This guide walks you through deploying Neo4j, modeling your firm's data as a graph, and querying it alongside your vector store. You'll build a working knowledge graph in 2-3 hours.
When You Actually Need This
Skip this if you're just building document Q&A. Add Neo4j when you need to answer:
- "Which partners have worked on SEC compliance matters for fintech clients in the last 18 months?"
- "Show me all engagements where Sarah Chen and Michael Torres collaborated."
- "Which clients share the same industry, revenue band, and regulatory challenges?"
If your queries are purely content-based ("What does our M&A playbook say about due diligence?"), stick with vector search alone.
Prerequisites
Neo4j Instance
Use Neo4j Aura (managed cloud) for production. Free tier supports 200k nodes and 400k relationships. Sign up at console.neo4j.io. For local testing, run docker run -p 7474:7474 -p 7687:7687 neo4j:5.15.0.
Python Environment
Install the Neo4j driver: pip install neo4j pandas. You'll use Python to load data and run queries.
Data Sources
Export CSVs from your practice management system (Clio, PracticePanther) or CRM
Cypher Basics
Neo4j's query language. You'll learn enough in this guide, but skim the Cypher cheat sheet first.
Step 1: Design Your Graph Schema
Map your firm's data to nodes (entities) and relationships (connections). Start simple. You can always add complexity later.
Core Node Types
Client
Properties:clientId,name,industry,revenue,location,riskProfileMatter
Properties:matterId,name,practiceArea,startDate,endDate,status,billedAmountTimekeeper
Properties:timekeeperId,name,title,office,barAdmissions[],practiceAreas[]Document
Properties:docId,title,docType,createdDate,vectorId(links to your Pinecone/Weaviate record)
Core Relationship Types
(Client)-[:RETAINED_FOR]->(Matter)(Timekeeper)-[:WORKED_ON {hours: 45.5, role: "Lead Counsel"}]->(Matter)(Timekeeper)-[:SPECIALIZES_IN]->(PracticeArea)(Matter)-[:PRODUCED]->(Document)(Client)-[:REFERRED_BY]->(Client)
Draw this on paper or use arrows.app before writing any code.
Step 2: Set Up Neo4j Connection
Create neo4j_setup.py:
from neo4j import GraphDatabase
import os
class Neo4jConnection:
def __init__(self):
uri = os.getenv("NEO4J_URI", "neo4j+s://xxxxx.databases.neo4j.io")
user = os.getenv("NEO4J_USER", "neo4j")
password = os.getenv("NEO4J_PASSWORD")
self.driver = GraphDatabase.driver(uri, auth=(user, password))
def close(self):
self.driver.close()
def run_query(self, query, parameters=None):
with self.driver.session() as session:
result = session.run(query, parameters)
return [record.data() for record in result]
# Test connection
conn = Neo4jConnection()
result = conn.run_query("RETURN 'Connection successful' AS message")
print(result)
conn.close()
Set environment variables in .env:
NEO4J_URI=neo4j+s://your-instance.databases.neo4j.io
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-password
Run python neo4j_setup.py. You should see [{'message': 'Connection successful'}].
Step 3: Create Constraints and Indexes
Constraints prevent duplicate nodes. Indexes speed up lookups. Run these in Neo4j Browser (browser tab at your Aura instance URL) or via Python:
CREATE CONSTRAINT client_id IF NOT EXISTS
FOR (c:Client) REQUIRE c.clientId IS UNIQUE;
CREATE CONSTRAINT matter_id IF NOT EXISTS
FOR (m:Matter) REQUIRE m.matterId IS UNIQUE;
CREATE CONSTRAINT timekeeper_id IF NOT EXISTS
FOR (t:Timekeeper) REQUIRE t.timekeeperId IS UNIQUE;
CREATE CONSTRAINT document_id IF NOT EXISTS
FOR (d:Document) REQUIRE d.docId IS UNIQUE;
CREATE INDEX matter_practice_area IF NOT EXISTS
FOR (m:Matter) ON (m.practiceArea);
CREATE INDEX timekeeper_name IF NOT EXISTS
FOR (t:Timekeeper) ON (t.name);
These take 5-10 seconds to create. Verify with SHOW CONSTRAINTS and SHOW INDEXES.
Step 4: Load Data from CSVs
Export your data to CSVs. Place them in a data/ folder. Example structure:
clients.csv
clientId,name,industry,revenue,location
C001,Acme Manufacturing,Manufacturing,50000000,San Francisco
C002,TechStart Inc,Technology,5000000,Austin
matters.csv
matterId,clientId,name,practiceArea,startDate,endDate,billedAmount
M001,C001,Supply Chain Dispute,Litigation,2023-01-15,2023-09-30,125000
M002,C002,Series A Financing,Corporate,2023-03-01,2023-04-15,45000
timekeepers.csv
timekeeperId,name,title,office,practiceAreas
T001,Sarah Chen,Partner,San Francisco,Litigation;Employment
T002,Michael Torres,Senior Associate,Austin,Corporate;Securities
matter_assignments.csv
matterId,timekeeperId,hours,role
M001,T001,87.5,Lead Counsel
M001,T002,12.0,Research Support
M002,T002,34.5,Lead Counsel
Load clients:
import pandas as pd
def load_clients(conn, csv_path):
df = pd.read_csv(csv_path)
query = """
UNWIND $rows AS row
MERGE (c:Client {clientId: row.clientId})
SET c.name = row.name,
c.industry = row.industry,
c.revenue = toInteger(row.revenue),
c.location = row.location
"""
conn.run_query(query, {"rows": df.to_dict('records')})
print(f"Loaded {len(df)} clients")
load_clients(conn, "data/clients.csv")
Load matters and create client relationships:
def load_matters(conn, csv_path):
df = pd.read_csv(csv_path)
query = """
UNWIND $rows AS row
MERGE (m:Matter {matterId: row.matterId})
SET m.name = row.name,
m.practiceArea = row.practiceArea,
m.startDate = date(row.startDate),
m.endDate = date(row.endDate),
m.billedAmount = toFloat(row.billedAmount)
WITH m, row
MATCH (c:Client {clientId: row.clientId})
MERGE (c)-[:RETAINED_FOR]->(m)
"""
conn.run_query(query, {"rows": df.to_dict('records')})
print(f"Loaded {len(df)} matters")
load_matters(conn, "data/matters.csv")
Load timekeepers:
def load_timekeepers(conn, csv_path):
df = pd.read_csv(csv_path)
# Split practiceAreas string into array
df['practiceAreas'] = df['practiceAreas'].str.split(';')
query = """
UNWIND $rows AS row
MERGE (t:Timekeeper {timekeeperId: row.timekeeperId})
SET t.name = row.name,
t.title = row.title,
t.office = row.office,
t.practiceAreas = row.practiceAreas
"""
conn.run_query(query, {"rows": df.to_dict('records')})
print(f"Loaded {len(df)} timekeepers")
load_timekeepers(conn, "data/timekeepers.csv")
Load matter assignments:
def load_assignments(conn, csv_path):
df = pd.read_csv(csv_path)
query = """
UNWIND $rows AS row
MATCH (t:Timekeeper {timekeeperId: row.timekeeperId})
MATCH (m:Matter {matterId: row.matterId})
MERGE (t)-[w:WORKED_ON]->(m)
SET w.hours = toFloat(row.hours),
w.role = row.role
"""
conn.run_query(query, {"rows": df.to_dict('records')})
print(f"Loaded {len(df)} assignments")
load_assignments(conn, "data/matter_assignments.csv")
Run all loaders in sequence. Check Neo4j Browser: MATCH (n) RETURN count(n) should show your total node count.
Step 5: Query Your Knowledge Graph
Open Neo4j Browser and run these queries to verify your data.
Find all matters for a specific client:
MATCH (c:Client {name: "Acme Manufacturing"})-[:RETAINED_FOR]->(m:Matter)
RETURN m.name, m.practiceArea, m.billedAmount
ORDER BY m.startDate DESC
Find timekeepers who worked together on multiple matters:
MATCH (t1:Timekeeper)-[:WORKED_ON]->(m:Matter)<-[:WORKED_ON]-(t2:Timekeeper)
WHERE t1.timekeeperId < t2.timekeeperId
WITH t1, t2, count(DISTINCT m) AS sharedMatters
WHERE sharedMatters >= 2
RETURN t1.name, t2.name, sharedMatters
ORDER BY sharedMatters DESC
Find clients in the same industry with similar matter types:
MATCH (c1:Client)-[:RETAINED_FOR]->(m1:Matter)
MATCH (c2:Client)-[:RETAINED_FOR]->(m2:Matter)
WHERE c1.industry = c2.industry
AND m1.practiceArea = m2.practiceArea
AND c1.clientId < c2.clientId
RETURN c1.name, c2.name, c1.industry, m1.practiceArea, count(*) AS overlap
ORDER BY overlap DESC
LIMIT 10
Find the most experienced timekeeper in a practice area:
MATCH (t:Timekeeper)-[w:WORKED_ON]->(m:Matter {practiceArea: "Litigation"})
WITH t, sum(w.hours) AS totalHours, count(m) AS matterCount
RETURN t.name, t.title, totalHours, matterCount
ORDER BY totalHours DESC
LIMIT 5
Step 6: Integrate with Your Q&A System
Your RAG
Hybrid Retrieval Pattern
When a user asks "Who has M&A experience with healthcare clients?", your system should:
- Detect this is a relationship query (not a content query)
- Generate a Cypher query using an LLMLLMClick to read the full definition in our AI & Automation Glossary.
- Execute the query against Neo4j
- Format results as context for the final answer
Example Integration Code
from openai import OpenAI
def generate_cypher_query(user_question, schema_description):
client = OpenAI()
prompt = f"""You are a Cypher query generator for a law firm knowledge graph.
Schema:
{schema_description}
User question: {user_question}
Generate a Cypher query to answer this question. Return only the query, no explanation.
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return response.choices[0].message.content.strip()
def answer_with_graph(user_question, neo4j_conn):
schema = """
Nodes: Client, Matter, Timekeeper, Document
Relationships:
- (Client)-[:RETAINED_FOR]->(Matter)
- (Timekeeper)-[:WORKED_ON {hours, role}]->(Matter)
- (Matter)-[:PRODUCED]->(Document)
"""
cypher_query = generate_cypher_query(user_question, schema)
print(f"Generated query: {cypher_query}")
results = neo4j_conn.run_query(cypher_query)
# Format results for LLM
context = f"Query results:\n{results}"
# Generate final answer
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Answer based on the query results provided."},
{"role": "user", "content": f"Question: {user_question}\n\n{context}"}
]
)
return response.choices[0].message.content
Query Router
Add logic to decide when to use graph vs. vector search:
def route_query(user_question):
relationship_keywords = [
"who worked", "which clients", "find partners",
"collaborated", "experience with", "similar to"
]
if any(kw in user_question.lower() for kw in relationship_keywords):
return "graph"
else:
return "vector"
Step 7: Add Document Nodes
Link your vector store records to the graph. When you ingest a document into Pinecone/Weaviate, also create a Document node in Neo4j:
def link_document_to_matter(neo4j_conn, doc_id, matter_id, title, doc_type, vector_id):
query = """
MERGE (d:Document {docId: $docId})
SET d.title = $title,
d.docType = $docType,
d.vectorId = $vectorId,
d.createdDate = date()
WITH d
MATCH (m:Matter {matterId: $matterId})
MERGE (m)-[:PRODUCED]->(d)
"""
neo4j_conn.run_query(query, {
"docId": doc_id,
"matterId": matter_id,
"title": title,
"docType": doc_type,
"vectorId": vector_id
})
Now you can query: "Show me all briefs filed in matters where Sarah Chen was lead counsel."
MATCH (t:Timekeeper {name: "Sarah Chen"})-[w:WORKED_ON {role: "Lead Counsel"}]->(m:Matter)-[:PRODUCED]->(d:Document)
WHERE d.docType = "Brief"
RETURN d.title, m.name, d.createdDate
ORDER BY d.createdDate DESC
Maintenance and Scaling
Weekly Data Sync
Schedule a cron job to re-export CSVs from your practice management system and re-run the load scripts. Use MERGE instead of CREATE to avoid duplicates.
**Performance

Reviewed by Revenue Institute
This guide is actively maintained and reviewed by the implementation experts at Revenue Institute. As the creators of The AI Workforce Playbook, we test and deploy these exact frameworks for professional services firms scaling without new headcount.
Revenue Institute
Need help turning this guide into reality? Revenue Institute builds and implements the AI workforce for professional services firms.