Retrieval-Augmented Generation: Augmenting AI with External Knowledge
Computational Analysis of Social Complexity
Fall 2025, Spencer Lyon
Prerequisites
- L.A1.01 (LLMs and API calls)
- Basic understanding of vector spaces
- Networks concepts (Weeks 3-5)
Required Julia Packages
Install the following packages if you haven’t already:
using Pkg
Pkg.add(["HTTP", "JSON3", "DotEnv", "LinearAlgebra", "DataFrames"])Outcomes
- Understand the limitations of parametric knowledge in LLMs
- Implement a complete Retrieval-Augmented Generation (RAG) system
- Work with embeddings and perform vector similarity search
- Design knowledge bases for domain-specific AI applications
- Analyze trade-offs between parametric and retrieval-based knowledge
References
- Lewis et al. (2020) “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” arXiv:2005.11401
- Gao et al. (2023) “Retrieval-Augmented Generation for Large Language Models: A Survey” arXiv:2312.10997
- OpenAI Embeddings Documentation: https://
platform .openai .com /docs /guides /embeddings - OpenAI Responses API Quickstart: https://
platform .openai .com /docs /quickstart ?api -mode = responses - Anthropic Context Window Documentation: https://
docs .anthropic .com /claude /docs /models -overview
The Knowledge Problem¶
- In the previous lecture we learned how to interact with LLMs via API calls
- We saw that these models can be remarkably helpful for various tasks
- However, LLMs have a fundamental limitation: they only “know” what was in their training data
- This creates several problems that we’ll explore now
Problem 1: Knowledge Cutoffs¶
- LLMs are trained on data collected up to a specific date
- For example, GPT-4 (original) had a knowledge cutoff of September 2021
- Claude 3.5 Sonnet has a knowledge cutoff of April 2024
- Any events, research, or information after the cutoff date is unknown to the model
- This means:
- Can’t answer questions about recent events
- Doesn’t know about new research papers or discoveries
- Unaware of updated statistics or data
Example: If you ask GPT-4 (original) about a paper published in 2023, it will either say it doesn’t know or worse, hallucinate details about a paper it has never seen.
Problem 2: Hallucination¶
- LLMs are trained to generate plausible-sounding text
- When asked about something they don’t know, they often hallucinate: generate confident-sounding but factually incorrect information
- This is particularly dangerous because:
- The model sounds authoritative
- Users may not realize the information is false
- The hallucinated content can include fake citations, statistics, or facts
Example: Ask an LLM about a research paper that doesn’t exist. It might generate a plausible abstract, author names, and even “key findings” - all completely fabricated.
Problem 3: Domain-Specific Knowledge¶
- Even within their training data, LLMs have uneven knowledge
- Some domains are well-represented in training data (e.g., popular programming languages)
- Others are sparse (e.g., specialized academic subfields, proprietary company data)
- For your organization’s specific documents, data, or knowledge base, the LLM knows nothing
Example: You want an AI assistant to help with your company’s internal policies, product documentation, or research papers. The LLM has never seen these documents.
Connection to Network Information Flow¶
- Recall from Weeks 3-4 our study of information diffusion in networks
- We saw how information flows through network structures
- Key insight: who you’re connected to determines what information you can access
- LLMs face a similar problem:
- Their “connections” are frozen at training time
- They can’t access new information sources
- They can’t form new “edges” to external knowledge
- RAG solves this by creating a dynamic connection between the LLM and external knowledge sources
- Think of RAG as giving the LLM the ability to form new edges in the information network on-demand
The Solution: Retrieval-Augmented Generation¶
- Instead of relying solely on parametric knowledge (what’s encoded in model weights), we can augment the LLM with retrieval
- Basic idea:
- User asks a question
- Search a knowledge base for relevant information
- Provide that information to the LLM as context
- LLM generates answer based on retrieved information
- This approach:
- Eliminates knowledge cutoff problem (use up-to-date sources)
- Reduces hallucination (model works from provided facts)
- Enables domain-specific applications (use your own documents)
- Provides citations (you know which documents were used)
Embeddings: Representing Meaning as Vectors¶
- To implement retrieval, we need to search for relevant documents
- But how do we determine if a document is relevant to a query?
- The key innovation: embeddings - representations of text as high-dimensional vectors
- Embeddings capture semantic meaning: similar meanings → similar vectors
What are Embeddings?¶
- An embedding is a function that maps text to a d-dimensional vector
- Typically ranges from 384 to 3072 dimensions
- Key property: texts with similar meanings have similar embeddings (vectors close together in vector space)
Intuition: Think of each dimension as capturing some aspect of meaning
- One dimension might capture “technical vs casual”
- Another might capture “positive vs negative sentiment”
- Another might capture “about networks vs about game theory”
- With hundreds or thousands of dimensions, we can capture nuanced semantic relationships
Example: Word Analogies¶
One famous property of embeddings is that they capture relational meaning:
This works because the embedding space organizes concepts by their relationships:
- The vector from “man” to “king” represents “royalty”
- Adding that same vector to “woman” gives you something close to “queen”
Similar patterns work for other relationships:
- Paris - France + Germany ≈ Berlin
- Walking - Walked + Swimming ≈ Swam
Measuring Similarity: Cosine Similarity¶
- Once we have embeddings, we need a way to measure similarity between vectors
- The most common metric is cosine similarity
- For two vectors and :
- Returns a value between -1 and 1
- 1: vectors point in same direction (very similar)
- 0: vectors are orthogonal (unrelated)
- -1: vectors point in opposite directions (opposite meaning)
- Why cosine and not Euclidean distance?
- Cosine measures angle, not magnitude
- Makes sense for text: a longer document isn’t necessarily “farther” in meaning
- Embedding models are typically optimized for cosine similarity
Creating Embeddings with OpenAI¶
Let’s see how to create embeddings using OpenAI’s API.
Setup: Before running the code below, make sure you have a .env file in your working directory with your OpenAI API key:
OPENAI_API_KEY=sk-your-api-key-hereThe DotEnv package will automatically load these environment variables from the .env file.
using HTTP, JSON3, DotEnv
# Load environment variables from .env file
DotEnv.load!()
# Load API key from environment
OPENAI_API_KEY = ENV["OPENAI_API_KEY"]
function get_embedding(text::String; model="text-embedding-3-small")
"""
Get embedding vector for input text using OpenAI's API
Args:
text: Input text to embed
model: Embedding model to use (default: text-embedding-3-small)
Returns:
Vector of floats representing the embedding
"""
url = "https://api.openai.com/v1/embeddings"
headers = [
"Authorization" => "Bearer $(OPENAI_API_KEY)",
"Content-Type" => "application/json"
]
body = JSON3.write(Dict(
"input" => text,
"model" => model
))
response = HTTP.post(url, headers, body)
result = JSON3.read(String(response.body))
return Vector(result.data[1].embedding)
endget_embedding (generic function with 1 method)Let’s test this with some network-related concepts:
# Embed some network-related texts
texts = [
"Graph theory studies the mathematical structure of networks",
"The clustering coefficient measures the tendency of nodes to form triangles",
"Preferential attachment leads to power law degree distributions",
"I enjoy eating pizza with extra cheese"
]
embeddings = [get_embedding(text) for text in texts]
println("Embedding dimension: ", length(embeddings[1]))
println("First few values of first embedding: ", embeddings[1][1:5])Embedding dimension: 1536
First few values of first embedding: [-0.059553657, -0.029068848, 0.023821464, -0.023550766, -0.0076992884]
Now let’s compute cosine similarity between these texts:
using LinearAlgebra
function cosine_similarity(a::Vector{Float64}, b::Vector{Float64})
"""
Compute cosine similarity between two vectors
"""
return dot(a, b) / (norm(a) * norm(b))
end
# Compare first text (graph theory) to all others
println("Similarities to: '", texts[1], "'\n")
for (i, text) in enumerate(texts)
sim = cosine_similarity(embeddings[1], embeddings[i])
println("Text $i ($(text[1:min(40, end)])...): $(round(sim, digits=4))")
endSimilarities to: 'Graph theory studies the mathematical structure of networks'
Text 1 (Graph theory studies the mathematical st...): 1.0
Text 2 (The clustering coefficient measures the ...): 0.3744
Text 3 (Preferential attachment leads to power l...): 0.4689
Text 4 (I enjoy eating pizza with extra cheese...): 0.0422
What do we see?
- The first three texts (all about networks) have high similarity to each other (typically > 0.7)
- The pizza text has low similarity to the network texts (typically < 0.5)
- This demonstrates that embeddings capture semantic meaning
- Texts about similar topics cluster together in the embedding space
Exercise 1: Exploring Embeddings¶
Create embeddings for the following texts and compute their pairwise similarities:
- “Nash equilibrium is a solution concept in game theory”
- “Players choose strategies to maximize their payoffs”
- “Agent-based models simulate individual decision makers”
- “The weather today is sunny and warm”
Questions:
- Which pairs of texts have the highest similarity?
- Does the game theory text have higher similarity to the ABM text or the weather text? Why?
- What does this tell you about how embeddings capture domain knowledge?
# TODO: your code hereBuilding a RAG Pipeline¶
- Now that we understand embeddings, we can build a complete RAG system
- A RAG pipeline has four main stages:
- Chunking: Break documents into manageable pieces
- Embedding: Convert chunks to vectors and store them
- Retrieval: Find most relevant chunks for a query
- Generation: Use LLM to answer query with retrieved context
Let’s build each component step by step.
Stage 1: Chunking¶
- Most documents are too long to embed as a single unit
- We need to break them into chunks - smaller, semantically meaningful pieces
- Typical chunk sizes: 100-500 words or 400-2000 characters
Why chunk?
- Embeddings work best on focused, coherent pieces of text
- Enables fine-grained retrieval (find specific relevant paragraphs, not whole documents)
- Fits within LLM context limits (we can only inject so much context)
Chunking strategies:
- Fixed size: Every N characters or words
- Natural boundaries: By paragraph, section, or sentence
- Semantic: Using embeddings to identify topic shifts
For this lecture, we’ll use a simple paragraph-based chunking:
function chunk_text(text::String; max_chunk_size::Int=1000, overlap::Int=100)
"""
Split text into overlapping chunks
Args:
text: Input text to chunk
max_chunk_size: Maximum characters per chunk
overlap: Number of characters to overlap between chunks
Returns:
Vector of text chunks
"""
# Split by paragraphs first (double newline)
paragraphs = split(text, "\n\n")
chunks = String[]
current_chunk = ""
for para in paragraphs
# If adding this paragraph would exceed max size, save current chunk
if length(current_chunk) + length(para) > max_chunk_size && !isempty(current_chunk)
push!(chunks, strip(current_chunk))
# Start new chunk with overlap from previous
current_chunk = current_chunk[max(1, end-overlap):end]
end
current_chunk *= para * "\n\n"
end
# Don't forget last chunk
if !isempty(strip(current_chunk))
push!(chunks, strip(current_chunk))
end
return chunks
endchunk_text (generic function with 1 method)Stage 2: Vector Database¶
- Once we have chunks and embeddings, we need to store them efficiently
- A vector database stores embeddings and enables fast similarity search
- For this lecture, we’ll build a simple in-memory vector store
- Production systems typically use specialized vector databases like:
- Pinecone
- Weaviate
- Qdrant
- Chroma
- pgvector (PostgreSQL extension)
using DataFrames
mutable struct VectorStore
documents::Vector{String} # Original text chunks
embeddings::Vector{Vector{Float64}} # Corresponding embeddings
metadata::Vector{Dict{String, Any}} # Optional metadata (source, page, etc.)
end
# Constructor for empty store
VectorStore() = VectorStore(String[], Vector{Float64}[], Dict{String, Any}[])
function add_documents!(store::VectorStore, docs::Vector{String}; metadata=nothing)
"""
Add documents to the vector store by computing and storing their embeddings
"""
println("Embedding $(length(docs)) documents...")
for (i, doc) in enumerate(docs)
# Get embedding
emb = get_embedding(doc)
# Store document, embedding, and metadata
push!(store.documents, doc)
push!(store.embeddings, emb)
if metadata !== nothing && i <= length(metadata)
push!(store.metadata, metadata[i])
else
push!(store.metadata, Dict("index" => i))
end
if i % 10 == 0
println(" Processed $i/$(length(docs))...")
end
end
println("Done! Store now contains $(length(store.documents)) documents.")
end
function search(store::VectorStore, query::String; top_k::Int=5)
"""
Search for most similar documents to query
Args:
store: Vector store to search
query: Search query text
top_k: Number of results to return
Returns:
DataFrame with columns: rank, document, similarity, metadata
"""
# Embed the query
query_emb = get_embedding(query)
# Compute similarities to all documents
similarities = [cosine_similarity(query_emb, doc_emb) for doc_emb in store.embeddings]
# Get top k indices
top_indices = partialsortperm(similarities, 1:min(top_k, length(similarities)), rev=true)
# Build result dataframe
results = DataFrame(
rank = 1:length(top_indices),
document = store.documents[top_indices],
similarity = similarities[top_indices],
metadata = store.metadata[top_indices]
)
return results
endsearch (generic function with 1 method)Stage 3 & 4: Retrieval and Generation¶
- Now we can combine retrieval with LLM generation
- The key is to inject retrieved documents into the LLM’s context
- We’ll create a function that:
- Takes a user query
- Retrieves relevant chunks from vector store
- Constructs instructions with retrieved context
- Sends to LLM via OpenAI’s Responses API for final answer
Note: We’re using OpenAI’s Responses API (/v1/responses) which is the modern successor to the Chat Completions API. The Responses API uses instructions for system-level guidance and input for user queries, providing a cleaner interface for stateful interactions.
function rag_query(store::VectorStore, query::String; top_k::Int=3, model::String="gpt-5")
"""
Answer a query using RAG: retrieve relevant docs, then generate answer
Args:
store: Vector store containing knowledge base
query: User's question
top_k: Number of documents to retrieve
model: OpenAI model to use for generation (default: gpt-5)
Returns:
Generated answer and retrieved documents
"""
# Step 1: Retrieve relevant documents
results = search(store, query; top_k=top_k)
# Step 2: Construct context from retrieved documents
context = ""
for i in 1:nrow(results)
context *= "Document $i:\n$(results.document[i])\n\n"
end
# Step 3: Build instructions with context
instructions = """
You are a helpful assistant that answers questions based on provided context.
Use the context below to answer the user's question. If the context doesn't
contain enough information to answer the question, say so - do not make up information.
Context:
$(context)
"""
# Step 4: Call LLM using Responses API
response = call_gpt(
query;
instructions=instructions,
model=model
)
return (answer=response, sources=results)
end
function call_gpt(input::String; instructions::String="", model::String="gpt-5-nano", reasoning_effort::String="low")
url = "https://api.openai.com/v1/responses"
# OPENAI_API_KEY should be loaded from ENV via DotEnv
headers = [
"Authorization" => "Bearer $(ENV["OPENAI_API_KEY"])",
"Content-Type" => "application/json"
]
# Build request body with Responses API format
request_body = Dict(
"model" => model,
"input" => input,
"reasoning" => Dict("effort" => reasoning_effort)
)
# Add instructions if provided
if !isempty(instructions)
request_body["instructions"] = instructions
end
body = JSON3.write(request_body)
response = HTTP.post(url, headers, body)
result = JSON3.read(String(response.body))
# NOTE: result.output is an array that can have multiple items. We are assuming a simple case here...
return result.output[end].content[1].text
endcall_gpt (generic function with 1 method)Application: Network Analysis Assistant¶
- Let’s build a practical RAG application: an AI assistant for graph theory and network analysis
- We’ll create a knowledge base from summaries of key network concepts
- This assistant will be able to answer questions about network theory using retrieved information
- In a real application, you would load actual research papers, textbooks, or documentation
Building the Knowledge Base¶
Let’s create a knowledge base with information about graph theory and network analysis:
# Sample documents about network analysis
# In practice, you would load these from files, scrape papers, etc.
network_docs = [
"""
Clustering Coefficient
The clustering coefficient measures the degree to which nodes in a graph tend to
cluster together. For a given node, it is defined as the proportion of connections
between the node's neighbors which are also connected to each other.
Formally, for node i with degree k_i, if there are E_i edges between its neighbors,
the local clustering coefficient is C_i = 2*E_i / (k_i * (k_i - 1)).
The global clustering coefficient averages this measure across all nodes. High
clustering coefficients are characteristic of social networks and indicate the
presence of tightly-knit communities.
""",
"""
Preferential Attachment and Scale-Free Networks
Preferential attachment is a mechanism by which networks grow over time. When a new
node joins the network, it forms connections preferentially to nodes that already
have many connections - the "rich get richer" phenomenon.
The Barabási-Albert model (1999) demonstrated that preferential attachment leads to
scale-free networks with power law degree distributions. In such networks, P(k) ~ k^(-γ)
where γ is typically between 2 and 3.
Scale-free networks are characterized by the presence of hubs - nodes with
disproportionately high degree. Examples include the Internet, citation networks,
and social media platforms.
""",
"""
Small-World Networks
Small-world networks, introduced by Watts and Strogatz (1998), exhibit two key properties:
high clustering (like regular lattices) and short path lengths (like random graphs).
The small-world phenomenon, popularly known as "six degrees of separation," refers to
the observation that most pairs of nodes in many real networks are connected by short
paths, typically of length 6 or less.
The Watts-Strogatz model generates small-world networks by starting with a ring lattice
and randomly rewiring edges with some probability p. For intermediate values of p,
the network maintains high clustering while gaining shortcuts that reduce path length.
""",
"""
Centrality Measures
Centrality measures identify the most important nodes in a network. Different
centrality measures capture different notions of importance:
1. Degree centrality: Simply counts the number of connections. High degree nodes
are "hubs" that directly connect to many others.
2. Betweenness centrality: Counts how often a node lies on shortest paths between
other nodes. High betweenness nodes are "bridges" that control information flow.
3. Closeness centrality: Measures the average distance from a node to all others.
High closeness nodes can reach the network quickly.
4. Eigenvector centrality: A node is important if it connects to other important
nodes. This is the principle behind Google's PageRank algorithm.
""",
"""
Community Detection
Community detection aims to identify groups of nodes that are more densely connected
internally than to the rest of the network. Communities represent functional modules
or social groups.
The modularity measure quantifies the quality of a partition. For partition C,
modularity Q = (1/2m) * Σ[A_ij - (k_i*k_j/2m)] * δ(c_i, c_j), where m is the total
number of edges, A_ij is the adjacency matrix, and δ(c_i, c_j) = 1 if nodes i and j
are in the same community.
Popular algorithms include the Louvain method, which greedily optimizes modularity,
and spectral methods based on the graph Laplacian. Community structure appears in
social networks (friend groups), biological networks (protein complexes), and
information networks (topic clusters).
""",
"""
Network Resilience and Robustness
Network resilience refers to a network's ability to maintain functionality when nodes
or edges are removed. This is crucial for infrastructure networks, communication systems,
and biological systems.
Scale-free networks exhibit a peculiar property: they are robust to random failures
but vulnerable to targeted attacks. Removing random nodes has little effect because
most nodes have low degree. However, removing hubs quickly fragments the network.
The resilience of random graphs is more uniform - they handle both random and targeted
removal similarly. Measuring resilience involves computing the size of the largest
connected component as nodes are progressively removed.
""",
"""
Network Formation and Strategic Link Formation
In many contexts, network connections form through strategic decisions by self-interested
agents. Game-theoretic models of network formation consider the costs and benefits of
forming links.
Jackson and Wolinsky (1996) introduced the concept of pairwise stability: a network is
pairwise stable if no agent wants to sever a link and no pair of agents wants to form
a new link.
The tension between individual incentives and social welfare is central to network
formation theory. The star network, where one central node connects to all others,
often minimizes total distance (socially optimal) but may not be pairwise stable
without appropriate transfer payments.
""",
"""
Information Diffusion and Cascades
Information, behaviors, and innovations spread through networks via social influence.
The threshold model (Granovetter, 1978) posits that individuals adopt a behavior when
a threshold fraction of their neighbors have adopted it.
A cascade occurs when an initial adoption by a few nodes triggers a chain reaction
of adoptions throughout the network. Whether cascades occur depends on the distribution
of thresholds and the network structure.
Weak ties play a crucial role in diffusion (Granovetter, 1973). While strong ties
create dense, clustered groups, weak ties serve as bridges between communities and
enable information to reach new populations. This explains why job seekers often find
opportunities through acquaintances rather than close friends.
"""
]
println("Created knowledge base with $(length(network_docs)) documents")Created knowledge base with 8 documents
Now let’s create our vector store and add these documents:
# Initialize vector store
network_store = VectorStore()
# Add documents to store (this will take a moment as we embed each document)
add_documents!(network_store, network_docs)Embedding 8 documents...
Done! Store now contains 8 documents.
Testing Our Network Analysis Assistant¶
Let’s test our RAG system with some questions about networks:
# Question 1: About clustering
q1 = "What is the clustering coefficient and how is it calculated?"
result1 = rag_query(network_store, q1)
println("Question: ", q1)
println("\nAnswer:\n", result1.answer)
println("\n" * "="^80)
println("Top sources used:")
for i in 1:min(2, nrow(result1.sources))
println("\n[$i] (similarity: $(round(result1.sources.similarity[i], digits=3)))")
println(result1.sources.document[i][1:min(200, end)] * "...")
endQuestion: What is the clustering coefficient and how is it calculated?
Answer:
The clustering coefficient quantifies how much a node’s neighbors are connected to each other (how “clustered” they are).
- Local clustering coefficient for node i: C_i = 2*E_i / (k_i * (k_i - 1))
- k_i: degree of node i (number of neighbors)
- E_i: number of edges that actually exist between those neighbors
- It measures the fraction of possible neighbor-to-neighbor links that are present.
- Global clustering coefficient: the average of the local clustering coefficients over all nodes.
High values indicate tightly knit groups, common in social networks.
================================================================================
Top sources used:
[1] (similarity: 0.765)
Clustering Coefficient
The clustering coefficient measures the degree to which nodes in a graph tend to
cluster together. For a given node, it is defined as the proportion of connections
between the ...
[2] (similarity: 0.416)
Community Detection
Community detection aims to identify groups of nodes that are more densely connected
internally than to the rest of the network. Communities represent functional modules
or social...
# Question 2: About scale-free networks
q2 = "Why are scale-free networks vulnerable to targeted attacks?"
result2 = rag_query(network_store, q2)
println("Question: ", q2)
println("\nAnswer:\n", result2.answer)Question: Why are scale-free networks vulnerable to targeted attacks?
Answer:
Because scale-free networks rely on a few high-degree hubs to hold the network together. Targeted attacks that remove these hubs quickly fragment the network and shrink the largest connected component, causing a rapid loss of connectivity, even though random failures (which mostly hit low-degree nodes) have little effect.
# Question 3: About weak ties
q3 = "What role do weak ties play in information diffusion?"
result3 = rag_query(network_store, q3)
println("Question: ", q3)
println("\nAnswer:\n", result3.answer)Question: What role do weak ties play in information diffusion?
Answer:
Weak ties act as bridges between otherwise separate, densely clustered groups. By connecting different communities, they let information travel beyond local circles, reaching new populations and enabling broader cascades of adoption. This is why people often hear about job opportunities through acquaintances rather than close friends.
Comparing RAG vs Vanilla LLM¶
Let’s demonstrate the value of RAG by comparing answers with and without retrieved context:
function vanilla_llm_query(query::String; model::String="gpt-5")
"""
Query LLM without RAG (no retrieved context)
"""
return call_gpt(
query;
instructions="You are a helpful assistant.",
model=model
)
end
# Ask a specific question that requires precise information
specific_q = "What is the formula for local clustering coefficient in a graph?"
println("Question: ", specific_q)
println("\n" * "="^80)
println("Vanilla LLM Answer:")
println(vanilla_llm_query(specific_q))
println("\n" * "="^80)
println("RAG Answer:")
println(rag_query(network_store, specific_q).answer)Question: What is the formula for local clustering coefficient in a graph?
================================================================================
Vanilla LLM Answer:
For an undirected, unweighted graph, the local clustering coefficient of a node v is:
C(v) = 2 e_v / [k_v (k_v − 1)]
- k_v = degree of v (number of neighbors)
- e_v = number of edges between neighbors of v (equivalently, number of triangles involving v)
By convention, C(v) = 0 if k_v < 2.
Note (directed variant): For directed graphs, the maximal number of possible edges among neighbors is k_v (k_v − 1), so a common definition is C(v) = e_v / [k_v (k_v − 1)], where e_v counts directed edges among neighbors.
================================================================================
RAG Answer:
C_i = 2 E_i / [k_i (k_i − 1)], where k_i is the degree of node i and E_i is the number of edges between its neighbors.
Observations:
Both answers are likely correct (this is a well-known formula in the LLM’s training data). However:
- With RAG, we know exactly where the answer came from (we can cite sources)
- With RAG, if we had proprietary or recent information, the LLM couldn’t answer without it
- With RAG, we can update knowledge without retraining the model
Let’s test with a question the LLM definitely doesn’t know:
# Add a "research paper" about a fictional concept
fictional_doc = ["""
Recent work by Spencer Lyon (2025) introduced the concept of "resonance centrality"
for dynamic networks. Unlike traditional centrality measures which focus on static
network structure, resonance centrality captures how a node's importance oscillates
over time due to feedback loops.
The resonance centrality R_i(t) for node i at time t is computed as:
R_i(t) = β * Σ_j A_ij(t) * R_j(t-1) + (1-β) * d_i(t)
where A_ij(t) is the time-varying adjacency matrix, d_i(t) is the degree at time t,
and β ∈ [0,1] controls the memory of past centrality.
Lyon showed that in social networks, nodes with high resonance centrality are often
"trend catalysts" - they don't just spread information, but trigger cascades that
feed back to amplify their influence in subsequent time periods.
"""]
add_documents!(network_store, fictional_doc)
# Now ask about it
fictional_q = "What is resonance centrality and how is it calculated?"
println("Question: ", fictional_q)
println("\n" * "="^80)
println("Vanilla LLM Answer:")
println(vanilla_llm_query(fictional_q))
println("\n" * "="^80)
println("RAG Answer:")
println(rag_query(network_store, fictional_q).answer)Embedding 1 documents...
Done! Store now contains 9 documents.
Question: What is resonance centrality and how is it calculated?
================================================================================
Vanilla LLM Answer:
Resonance centrality is a frequency-dependent notion of node importance for networks with dynamical processes. It ranks nodes by how strongly they respond to an oscillatory (periodic) input that propagates through the network dynamics. In other words, a node is “resonantly central” if the dynamics make it amplify signals at some frequency more than other nodes.
How it is defined (general idea)
- Choose a linear dynamical model on the network (first- or second-order), and a vector b describing which nodes are externally driven.
- Drive the system with a sinusoidal input u(t) = Re{u0 e^{iωt}}.
- Compute the steady-state frequency response (transfer function) from the input to each node. The magnitude of that response at node i is its frequency-dependent centrality gi(ω).
- Resonance centrality of node i is then taken as, e.g., the peak response maxω gi(ω) (and the corresponding ω is the node’s resonant frequency). Some variants use the squared magnitude or the area under the curve across ω.
Common formulations
1) First-order (diffusive/consensus-like) dynamics
- Dynamics: x′(t) = A x(t) + b u(t) with A stable (often A = −L for a Laplacian L of an undirected graph).
- Frequency response: H(ω) = (iωI − A)^{-1} b.
- Node-i response magnitude: gi(ω) = |ei^T H(ω)| = |[(iωI − A)^{-1} b]_i|.
- Resonance centrality (peak gain): Ri = maxω gi(ω).
Notes:
- If A = −L with L ≥ 0 (diffusion), the response typically peaks at low frequency (ω → 0). Then Ri ≈ |[L^† b]_i| (L^† is the Moore–Penrose pseudoinverse), so “resonance” reduces to a low-frequency (quasi-static) sensitivity, closely related to resistance-based measures.
2) Second-order (oscillator/inertial) dynamics
- Dynamics: M x″(t) + C x′(t) + K x(t) = b u(t), with M, C, K ≽ 0 (e.g., coupled oscillators, power grids).
- Dynamic stiffness: Z(ω) = K − ω^2 M + iω C.
- Frequency response: H(ω) = Z(ω)^{-1} b.
- Node-i response magnitude: gi(ω) = |ei^T H(ω)|.
- Resonance centrality: Ri = maxω gi(ω). True resonance peaks occur near the system’s natural frequencies (eigenvalues of (M, K) with damping from C).
Spectral form (useful for computation and insight)
- For first-order A diagonalizable: A = VΛV^{-1}, H(ω) = V (iωI − Λ)^{-1} V^{-1} b, so
gi(ω) = |∑k (vik wk)/(iω − λk)|, where wk are components of V^{-1}b.
- Peaks occur where |iω − λk| is small and the mode k is excited (wk large) and observed at node i (vik large).
Practical computation
- Choose dynamics (A, or M, C, K) and the input profile b (uniform, single node, or application-specific).
- Pick a frequency grid ω ∈ [ωmin, ωmax] covering relevant dynamics (near natural frequencies if known).
- For each ω, solve a linear system:
- First-order: y(ω) = (iωI − A)^{-1} b.
- Second-order: y(ω) = (K − ω^2 M + iω C)^{-1} b.
- Take gi(ω) = |y_i(ω)|. Define Ri = maxω gi(ω) (and optionally record the argmax for the resonant frequency).
- Variants: use squared magnitude, L2 norm over frequencies ∫ gi(ω)^2 dω, or multiple input profiles and average.
Connections to other centralities
- Setting ω = 0 in first-order systems recovers static sensitivity measures linked to resistance distance and current-flow centralities.
- Evaluating (sI − A)^{-1} at complex s relates resonance centrality to resolvent/Katz-type centralities; resonance centrality is the frequency-response version (s on the imaginary axis).
In short: resonance centrality is the node-wise peak (or frequency-shaped) gain of the network’s transfer function under oscillatory driving, computed via inverses of (iωI − A) for first-order or of dynamic stiffness matrices for second-order dynamics.
================================================================================
RAG Answer:
Resonance centrality is a dynamic-network centrality measure that captures how a node’s importance oscillates over time due to feedback loops. It emphasizes not just current connections but how past influence reverberates and amplifies across time.
Calculation:
For node i at time t:
R_i(t) = β * Σ_j A_ij(t) * R_j(t−1) + (1 − β) * d_i(t)
- A_ij(t): time-varying adjacency (connection from j to i at time t)
- R_j(t−1): resonance centrality of node j at the previous time step
- d_i(t): degree of node i at time t
- β ∈ [0,1]: memory parameter controlling how much past centrality influences the present
Interpretation:
- Higher β gives more weight to past influence propagating through the network (feedback).
- Lower β leans more on the node’s current degree.
In social networks, nodes with high resonance centrality tend to be “trend catalysts” that trigger cascades which amplify their future influence.
What happened?
- The vanilla LLM either says it doesn’t know, or hallucinates a plausible-sounding definition
- The RAG system correctly retrieves and uses our fictional paper
- This demonstrates how RAG enables working with proprietary, recent, or domain-specific knowledge
Trade-offs and Design Considerations¶
Now that we’ve built a working RAG system, let’s discuss key design decisions and trade-offs.
Parametric vs Retrieval-Based Knowledge¶
Parametric Knowledge (in model weights):
- Pros:
- Fast: no need to search external sources
- Synthesizes information from vast training corpus
- Good for general knowledge and reasoning
- Cons:
- Fixed at training time
- Can’t update without retraining
- May hallucinate when uncertain
- No citations or provenance
Retrieval-Based Knowledge (via RAG):
- Pros:
- Can use up-to-date information
- Access to proprietary/specialized knowledge
- Provides citations and sources
- Reduces hallucination
- Updates without retraining
- Cons:
- Slower (requires retrieval step)
- Quality depends on retrieval accuracy
- Requires maintaining vector database
- May miss information requiring synthesis across many sources
Key Design Decisions¶
1. Chunk Size
- Smaller chunks (100-200 words):
- More precise retrieval
- Better for question answering
- May lose context across chunk boundaries
- Larger chunks (500-1000 words):
- Preserve more context
- Better for complex reasoning
- May include irrelevant information
2. Number of Retrieved Documents (top_k)
- Few documents (k=1-3):
- Focuses on most relevant information
- Faster, uses less context window
- May miss important information
- Many documents (k=5-10):
- More comprehensive context
- Better for complex queries
- May include noise, uses more tokens
3. Embedding Model
- Smaller models (text-embedding-3-small: 512-1536 dims):
- Faster embedding and search
- Lower cost
- Slightly lower quality
- Larger models (text-embedding-3-large: 3072 dims):
- Better semantic understanding
- More accurate retrieval
- Higher cost and latency
Advanced RAG Techniques¶
Our implementation is a basic RAG pipeline. Production systems often use:
1. Hybrid Search
- Combine vector similarity with keyword search (BM25)
- Leverages both semantic and lexical matching
- More robust to different query types
2. Re-ranking
- Use a separate model to re-rank retrieved results
- Can consider query-document interaction more deeply
- Improves precision of top results
3. Query Expansion
- Generate multiple variations of user query
- Search with all variations
- Helps with vocabulary mismatch between query and documents
4. Hierarchical Retrieval
- First retrieve at document level
- Then retrieve specific passages within those documents
- Balances context and precision
5. Iterative RAG
- Let LLM decide what to retrieve next
- Multiple retrieval rounds
- Better for complex, multi-hop reasoning
Connection to Course Themes¶
RAG systems connect to several course concepts:
Networks and Information Flow
- RAG creates edges from LLM to knowledge sources
- Quality of answer depends on network structure (which documents are connected via similarity)
- Similar to how social network structure affects information diffusion
Game Theory and Mechanism Design
- In multi-agent RAG systems, agents must decide what information to share
- Strategic information retrieval: costly to retrieve everything, must be selective
- Related to costly network formation from our game theory module
Agent-Based Modeling
- Can model RAG systems as agents with retrieval actions
- Environment is the knowledge base
- Rules govern when to retrieve, what to retrieve, how to synthesize
- Next lecture will explore multi-agent AI systems in depth
Exercise 2: Build Your Own RAG System¶
Choose one of the following domains and build a RAG system:
Option A: Game Theory Assistant
- Create documents summarizing key game theory concepts from our Week 8-9 lectures
- Include: Nash equilibrium, dominant strategies, mixed strategies, auctions
- Test with questions like “What is a Nash equilibrium?” or “Explain the winner’s curse”
Option B: Julia Programming Helper
- Create documents with Julia programming tips and common patterns
- Include: array operations, data frames, plotting, packages we’ve used
- Test with questions like “How do I create a histogram?” or “What’s the syntax for filtering a DataFrame?”
Option C: Your Domain
- Choose a topic you’re interested in
- Create 5-10 documents with information
- Build and test a RAG system
For your chosen option:
- Create at least 5 documents (can be shorter than our examples)
- Build a vector store
- Test with at least 3 different questions
- Compare RAG answers vs vanilla LLM answers
- Experiment with different values of
top_k(number of retrieved documents)
# TODO: your code here
# Step 1: Create your documents
my_docs = [
# Your documents here
]
# Step 2: Build vector store
my_store = VectorStore()
# add_documents!(my_store, my_docs)
# Step 3: Test with questions
# test_q1 = "..."
# result = rag_query(my_store, test_q1)
# println(result.answer)Exercise 3: Evaluating Retrieval Quality¶
An important aspect of RAG systems is retrieval quality - are we finding the right documents?
Using our network_store:
- Create 3 test queries where you know which document(s) should be retrieved
- For each query, perform retrieval with different values of k (k=1, 3, 5)
- Examine the similarity scores - what do they tell you?
- Try a query that requires information from multiple documents
- Try a query that’s completely off-topic (e.g., about cooking) - what gets retrieved?
Questions to explore:
- Does the highest similarity document always have the best answer?
- How much does similarity score drop from rank 1 to rank 3?
- What happens when the query uses different terminology than the documents?
# TODO: your analysis here
# Example structure:
# test_queries = [
# "query 1...",
# "query 2...",
# "query 3..."
# ]
# for query in test_queries
# println("Query: ", query)
# results = search(network_store, query; top_k=5)
# # Analyze results
# endExercise 4: Design Decisions¶
Consider the following scenarios and discuss what RAG design choices you would make:
Scenario 1: Legal Document Q&A
- Database: 10,000 legal contracts and case documents
- Queries: Lawyers asking about specific clauses and precedents
- Requirements: High accuracy, must cite sources, complex reasoning
Scenario 2: Customer Support Chatbot
- Database: 500 FAQ articles and product documentation
- Queries: Customers asking how to use products, troubleshoot issues
- Requirements: Fast responses, conversational, handles unclear questions
Scenario 3: Academic Research Assistant
- Database: 100,000 research papers
- Queries: Researchers asking about literature, methodologies, findings
- Requirements: Comprehensive answers, identifies connections between papers
For each scenario, decide:
- Chunk size (small, medium, large)?
- Number of documents to retrieve (k=?)?
- Would you use any advanced techniques (hybrid search, re-ranking, etc.)?
- How would you handle queries that require information from multiple documents?
- How would you evaluate the system’s performance?
Write your analysis below:
Your Analysis:
Scenario 1: Legal Document Q&A
- Chunk size: ...
- k: ...
- Advanced techniques: ...
- Multi-document handling: ...
- Evaluation: ...
(Continue for other scenarios)
Looking Ahead: Multi-Agent Systems¶
- In this lecture we built a single-agent RAG system
- The LLM agent retrieves information and generates answers
- But what if we had multiple AI agents working together?
- Next lecture (A1.03) we’ll explore multi-agent AI systems:
- Multiple LLM agents with specialized roles
- Agents that coordinate and communicate
- Connections to agent-based models and game theory
- Building complex workflows with agentic systems
Think about:
- How might multiple agents improve retrieval? (e.g., one agent generates queries, another evaluates relevance)
- What if different agents had access to different knowledge bases?
- How does this relate to our study of networks and strategic interaction?
Key Takeaways¶
LLMs have fundamental knowledge limitations: cutoffs, hallucination, missing domain-specific info
RAG augments LLMs with external knowledge through retrieval from vector databases
Embeddings represent text as vectors capturing semantic meaning, enabling similarity search
A RAG pipeline has four stages: chunking, embedding, retrieval, generation
Design trade-offs matter: chunk size, number of retrieved docs, embedding model all affect quality and performance
RAG enables practical AI applications with up-to-date, domain-specific, and proprietary knowledge
RAG connects to network concepts: creating dynamic information flow from LLM to knowledge sources
Citations and provenance are key benefits of RAG over pure parametric knowledge
Further Reading¶
Foundational Papers:
- Lewis et al. (2020) “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”
- Gao et al. (2023) “Retrieval-Augmented Generation for Large Language Models: A Survey”
Advanced RAG:
- Khattab et al. (2023) “DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines” arXiv:2310.03714
- Shi et al. (2023) “REPLUG: Retrieval-Augmented Black-Box Language Models” arXiv:2301.12652
Vector Databases:
- Pinecone Documentation: https://
docs .pinecone .io/ - Weaviate Concepts: https://
weaviate .io /developers /weaviate /concepts
Embeddings:
- OpenAI Embeddings Guide: https://
platform .openai .com /docs /guides /embeddings - Neelakantan et al. (2022) “Text and Code Embeddings by Contrastive Pre-Training” arXiv:2201.10005