Retrieval-Augmented Generation: Augmenting AI with External Knowledge

Computational Analysis of Social Complexity
Fall 2025, Spencer Lyon

Prerequisites

L.A1.01 (LLMs and API calls)
Basic understanding of vector spaces
Networks concepts (Weeks 3-5)

Required Julia Packages

Install the following packages if you haven’t already:

using Pkg
Pkg.add(["HTTP", "JSON3", "DotEnv", "LinearAlgebra", "DataFrames"])

Outcomes

Understand the limitations of parametric knowledge in LLMs
Implement a complete Retrieval-Augmented Generation (RAG) system
Work with embeddings and perform vector similarity search
Design knowledge bases for domain-specific AI applications
Analyze trade-offs between parametric and retrieval-based knowledge

References

Lewis et al. (2020) “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” arXiv:2005.11401
Gao et al. (2023) “Retrieval-Augmented Generation for Large Language Models: A Survey” arXiv:2312.10997
OpenAI Embeddings Documentation: https://platform.openai.com/docs/guides/embeddings
OpenAI Responses API Quickstart: https://platform.openai.com/docs/quickstart?api-mode=responses
Anthropic Context Window Documentation: https://docs.anthropic.com/claude/docs/models-overview

The Knowledge Problem¶

In the previous lecture we learned how to interact with LLMs via API calls
We saw that these models can be remarkably helpful for various tasks
However, LLMs have a fundamental limitation: they only “know” what was in their training data
This creates several problems that we’ll explore now

Problem 1: Knowledge Cutoffs¶

LLMs are trained on data collected up to a specific date
For example, GPT-4 (original) had a knowledge cutoff of September 2021
Claude 3.5 Sonnet has a knowledge cutoff of April 2024
Any events, research, or information after the cutoff date is unknown to the model
This means:
- Can’t answer questions about recent events
- Doesn’t know about new research papers or discoveries
- Unaware of updated statistics or data

Example: If you ask GPT-4 (original) about a paper published in 2023, it will either say it doesn’t know or worse, hallucinate details about a paper it has never seen.

Problem 2: Hallucination¶

LLMs are trained to generate plausible-sounding text
When asked about something they don’t know, they often hallucinate: generate confident-sounding but factually incorrect information
This is particularly dangerous because:
- The model sounds authoritative
- Users may not realize the information is false
- The hallucinated content can include fake citations, statistics, or facts

Example: Ask an LLM about a research paper that doesn’t exist. It might generate a plausible abstract, author names, and even “key findings” - all completely fabricated.

Problem 3: Domain-Specific Knowledge¶

Even within their training data, LLMs have uneven knowledge
Some domains are well-represented in training data (e.g., popular programming languages)
Others are sparse (e.g., specialized academic subfields, proprietary company data)
For your organization’s specific documents, data, or knowledge base, the LLM knows nothing

Example: You want an AI assistant to help with your company’s internal policies, product documentation, or research papers. The LLM has never seen these documents.

Connection to Network Information Flow¶

Recall from Weeks 3-4 our study of information diffusion in networks
We saw how information flows through network structures
Key insight: who you’re connected to determines what information you can access
LLMs face a similar problem:
- Their “connections” are frozen at training time
- They can’t access new information sources
- They can’t form new “edges” to external knowledge
RAG solves this by creating a dynamic connection between the LLM and external knowledge sources
Think of RAG as giving the LLM the ability to form new edges in the information network on-demand

The Solution: Retrieval-Augmented Generation¶

Instead of relying solely on parametric knowledge (what’s encoded in model weights), we can augment the LLM with retrieval
Basic idea:
1. User asks a question
2. Search a knowledge base for relevant information
3. Provide that information to the LLM as context
4. LLM generates answer based on retrieved information
This approach:
- Eliminates knowledge cutoff problem (use up-to-date sources)
- Reduces hallucination (model works from provided facts)
- Enables domain-specific applications (use your own documents)
- Provides citations (you know which documents were used)

Embeddings: Representing Meaning as Vectors¶

To implement retrieval, we need to search for relevant documents
But how do we determine if a document is relevant to a query?
The key innovation: embeddings - representations of text as high-dimensional vectors
Embeddings capture semantic meaning: similar meanings → similar vectors

What are Embeddings?¶

An embedding is a function $f: \text{Text} \rightarrow \mathbb{R}^d$ that maps text to a d-dimensional vector
Typically $d$ ranges from 384 to 3072 dimensions
Key property: texts with similar meanings have similar embeddings (vectors close together in vector space)

Intuition: Think of each dimension as capturing some aspect of meaning

One dimension might capture “technical vs casual”
Another might capture “positive vs negative sentiment”
Another might capture “about networks vs about game theory”
With hundreds or thousands of dimensions, we can capture nuanced semantic relationships

Example: Word Analogies¶

One famous property of embeddings is that they capture relational meaning:

\text{embedding}(\text{"king"}) - \text{embedding}(\text{"man"}) + \text{embedding}(\text{"woman"}) \approx \text{embedding}(\text{"queen"})

(1)

This works because the embedding space organizes concepts by their relationships:

The vector from “man” to “king” represents “royalty”
Adding that same vector to “woman” gives you something close to “queen”

Similar patterns work for other relationships:

Paris - France + Germany ≈ Berlin
Walking - Walked + Swimming ≈ Swam

Measuring Similarity: Cosine Similarity¶

Once we have embeddings, we need a way to measure similarity between vectors
The most common metric is cosine similarity
For two vectors $\mathbf{a}$ and $\mathbf{b}$ :

\text{cosine\_similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \, ||\mathbf{b}||} = \frac{\sum_{i=1}^d a_i b_i}{\sqrt{\sum_{i=1}^d a_i^2} \sqrt{\sum_{i=1}^d b_i^2}}

(2)

Returns a value between -1 and 1
- 1: vectors point in same direction (very similar)
- 0: vectors are orthogonal (unrelated)
- -1: vectors point in opposite directions (opposite meaning)
Why cosine and not Euclidean distance?
- Cosine measures angle, not magnitude
- Makes sense for text: a longer document isn’t necessarily “farther” in meaning
- Embedding models are typically optimized for cosine similarity

Creating Embeddings with OpenAI¶

Let’s see how to create embeddings using OpenAI’s API.

Setup: Before running the code below, make sure you have a .env file in your working directory with your OpenAI API key:

OPENAI_API_KEY=sk-your-api-key-here

The DotEnv package will automatically load these environment variables from the .env file.

using HTTP, JSON3, DotEnv

# Load environment variables from .env file
DotEnv.load!()

# Load API key from environment
OPENAI_API_KEY = ENV["OPENAI_API_KEY"]

function get_embedding(text::String; model="text-embedding-3-small")
    """
    Get embedding vector for input text using OpenAI's API

    Args:
        text: Input text to embed
        model: Embedding model to use (default: text-embedding-3-small)

    Returns:
        Vector of floats representing the embedding
    """
    url = "https://api.openai.com/v1/embeddings"

    headers = [
        "Authorization" => "Bearer $(OPENAI_API_KEY)",
        "Content-Type" => "application/json"
    ]

    body = JSON3.write(Dict(
        "input" => text,
        "model" => model
    ))

    response = HTTP.post(url, headers, body)
    result = JSON3.read(String(response.body))

    return Vector(result.data[1].embedding)
end

get_embedding (generic function with 1 method)

Let’s test this with some network-related concepts:

# Embed some network-related texts
texts = [
    "Graph theory studies the mathematical structure of networks",
    "The clustering coefficient measures the tendency of nodes to form triangles",
    "Preferential attachment leads to power law degree distributions",
    "I enjoy eating pizza with extra cheese"
]

embeddings = [get_embedding(text) for text in texts]

println("Embedding dimension: ", length(embeddings[1]))
println("First few values of first embedding: ", embeddings[1][1:5])

Embedding dimension: 1536
First few values of first embedding: [-0.059553657, -0.029068848, 0.023821464, -0.023550766, -0.0076992884]

Now let’s compute cosine similarity between these texts:

using LinearAlgebra

function cosine_similarity(a::Vector{Float64}, b::Vector{Float64})
    """
    Compute cosine similarity between two vectors
    """
    return dot(a, b) / (norm(a) * norm(b))
end

# Compare first text (graph theory) to all others
println("Similarities to: '", texts[1], "'\n")
for (i, text) in enumerate(texts)
    sim = cosine_similarity(embeddings[1], embeddings[i])
    println("Text $i ($(text[1:min(40, end)])...): $(round(sim, digits=4))")
end

Similarities to: 'Graph theory studies the mathematical structure of networks'

Text 1 (Graph theory studies the mathematical st...): 1.0
Text 2 (The clustering coefficient measures the ...): 0.3744
Text 3 (Preferential attachment leads to power l...): 0.4689
Text 4 (I enjoy eating pizza with extra cheese...): 0.0422

What do we see?

The first three texts (all about networks) have high similarity to each other (typically > 0.7)
The pizza text has low similarity to the network texts (typically < 0.5)
This demonstrates that embeddings capture semantic meaning
Texts about similar topics cluster together in the embedding space

Exercise 1: Exploring Embeddings¶

Create embeddings for the following texts and compute their pairwise similarities:

“Nash equilibrium is a solution concept in game theory”
“Players choose strategies to maximize their payoffs”
“Agent-based models simulate individual decision makers”
“The weather today is sunny and warm”

Questions:

Which pairs of texts have the highest similarity?
Does the game theory text have higher similarity to the ABM text or the weather text? Why?
What does this tell you about how embeddings capture domain knowledge?

# TODO: your code here

Building a RAG Pipeline¶

Now that we understand embeddings, we can build a complete RAG system
A RAG pipeline has four main stages:
1. Chunking: Break documents into manageable pieces
2. Embedding: Convert chunks to vectors and store them
3. Retrieval: Find most relevant chunks for a query
4. Generation: Use LLM to answer query with retrieved context

Let’s build each component step by step.

Stage 1: Chunking¶

Most documents are too long to embed as a single unit
We need to break them into chunks - smaller, semantically meaningful pieces
Typical chunk sizes: 100-500 words or 400-2000 characters

Why chunk?

Embeddings work best on focused, coherent pieces of text
Enables fine-grained retrieval (find specific relevant paragraphs, not whole documents)
Fits within LLM context limits (we can only inject so much context)

Chunking strategies:

Fixed size: Every N characters or words
Natural boundaries: By paragraph, section, or sentence
Semantic: Using embeddings to identify topic shifts

For this lecture, we’ll use a simple paragraph-based chunking:

function chunk_text(text::String; max_chunk_size::Int=1000, overlap::Int=100)
    """
    Split text into overlapping chunks

    Args:
        text: Input text to chunk
        max_chunk_size: Maximum characters per chunk
        overlap: Number of characters to overlap between chunks

    Returns:
        Vector of text chunks
    """
    # Split by paragraphs first (double newline)
    paragraphs = split(text, "\n\n")

    chunks = String[]
    current_chunk = ""

    for para in paragraphs
        # If adding this paragraph would exceed max size, save current chunk
        if length(current_chunk) + length(para) > max_chunk_size && !isempty(current_chunk)
            push!(chunks, strip(current_chunk))
            # Start new chunk with overlap from previous
            current_chunk = current_chunk[max(1, end-overlap):end]
        end

        current_chunk *= para * "\n\n"
    end

    # Don't forget last chunk
    if !isempty(strip(current_chunk))
        push!(chunks, strip(current_chunk))
    end

    return chunks
end

chunk_text (generic function with 1 method)

Stage 2: Vector Database¶

Once we have chunks and embeddings, we need to store them efficiently
A vector database stores embeddings and enables fast similarity search
For this lecture, we’ll build a simple in-memory vector store
Production systems typically use specialized vector databases like:
- Pinecone
- Weaviate
- Qdrant
- Chroma
- pgvector (PostgreSQL extension)

using DataFrames

mutable struct VectorStore
    documents::Vector{String}          # Original text chunks
    embeddings::Vector{Vector{Float64}}  # Corresponding embeddings
    metadata::Vector{Dict{String, Any}}  # Optional metadata (source, page, etc.)
end

# Constructor for empty store
VectorStore() = VectorStore(String[], Vector{Float64}[], Dict{String, Any}[])

function add_documents!(store::VectorStore, docs::Vector{String}; metadata=nothing)
    """
    Add documents to the vector store by computing and storing their embeddings
    """
    println("Embedding $(length(docs)) documents...")

    for (i, doc) in enumerate(docs)
        # Get embedding
        emb = get_embedding(doc)

        # Store document, embedding, and metadata
        push!(store.documents, doc)
        push!(store.embeddings, emb)

        if metadata !== nothing && i <= length(metadata)
            push!(store.metadata, metadata[i])
        else
            push!(store.metadata, Dict("index" => i))
        end

        if i % 10 == 0
            println("  Processed $i/$(length(docs))...")
        end
    end

    println("Done! Store now contains $(length(store.documents)) documents.")
end

function search(store::VectorStore, query::String; top_k::Int=5)
    """
    Search for most similar documents to query

    Args:
        store: Vector store to search
        query: Search query text
        top_k: Number of results to return

    Returns:
        DataFrame with columns: rank, document, similarity, metadata
    """
    # Embed the query
    query_emb = get_embedding(query)

    # Compute similarities to all documents
    similarities = [cosine_similarity(query_emb, doc_emb) for doc_emb in store.embeddings]

    # Get top k indices
    top_indices = partialsortperm(similarities, 1:min(top_k, length(similarities)), rev=true)

    # Build result dataframe
    results = DataFrame(
        rank = 1:length(top_indices),
        document = store.documents[top_indices],
        similarity = similarities[top_indices],
        metadata = store.metadata[top_indices]
    )

    return results
end

search (generic function with 1 method)

Stage 3 & 4: Retrieval and Generation¶

Now we can combine retrieval with LLM generation
The key is to inject retrieved documents into the LLM’s context
We’ll create a function that:
1. Takes a user query
2. Retrieves relevant chunks from vector store
3. Constructs instructions with retrieved context
4. Sends to LLM via OpenAI’s Responses API for final answer

Note: We’re using OpenAI’s Responses API (/v1/responses) which is the modern successor to the Chat Completions API. The Responses API uses instructions for system-level guidance and input for user queries, providing a cleaner interface for stateful interactions.

function rag_query(store::VectorStore, query::String; top_k::Int=3, model::String="gpt-5")
    """
    Answer a query using RAG: retrieve relevant docs, then generate answer

    Args:
        store: Vector store containing knowledge base
        query: User's question
        top_k: Number of documents to retrieve
        model: OpenAI model to use for generation (default: gpt-5)

    Returns:
        Generated answer and retrieved documents
    """
    # Step 1: Retrieve relevant documents
    results = search(store, query; top_k=top_k)

    # Step 2: Construct context from retrieved documents
    context = ""
    for i in 1:nrow(results)
        context *= "Document $i:\n$(results.document[i])\n\n"
    end

    # Step 3: Build instructions with context
    instructions = """
    You are a helpful assistant that answers questions based on provided context.
    Use the context below to answer the user's question. If the context doesn't
    contain enough information to answer the question, say so - do not make up information.

    Context:
    $(context)
    """

    # Step 4: Call LLM using Responses API
    response = call_gpt(
        query;
        instructions=instructions,
        model=model
    )

    return (answer=response, sources=results)
end

function call_gpt(input::String; instructions::String="", model::String="gpt-5-nano", reasoning_effort::String="low")
    url = "https://api.openai.com/v1/responses"

    # OPENAI_API_KEY should be loaded from ENV via DotEnv
    headers = [
        "Authorization" => "Bearer $(ENV["OPENAI_API_KEY"])",
        "Content-Type" => "application/json"
    ]

    # Build request body with Responses API format
    request_body = Dict(
        "model" => model,
        "input" => input,
        "reasoning" => Dict("effort" => reasoning_effort)
    )

    # Add instructions if provided
    if !isempty(instructions)
        request_body["instructions"] = instructions
    end

    body = JSON3.write(request_body)

    response = HTTP.post(url, headers, body)
    result = JSON3.read(String(response.body))

    # NOTE: result.output is an array that can have multiple items. We are assuming a simple case here...
    return result.output[end].content[1].text
end

call_gpt (generic function with 1 method)

Application: Network Analysis Assistant¶

Let’s build a practical RAG application: an AI assistant for graph theory and network analysis
We’ll create a knowledge base from summaries of key network concepts
This assistant will be able to answer questions about network theory using retrieved information
In a real application, you would load actual research papers, textbooks, or documentation

Building the Knowledge Base¶

Let’s create a knowledge base with information about graph theory and network analysis:

# Sample documents about network analysis
# In practice, you would load these from files, scrape papers, etc.
network_docs = [
    """
    Clustering Coefficient

    The clustering coefficient measures the degree to which nodes in a graph tend to
    cluster together. For a given node, it is defined as the proportion of connections
    between the node's neighbors which are also connected to each other.

    Formally, for node i with degree k_i, if there are E_i edges between its neighbors,
    the local clustering coefficient is C_i = 2*E_i / (k_i * (k_i - 1)).

    The global clustering coefficient averages this measure across all nodes. High
    clustering coefficients are characteristic of social networks and indicate the
    presence of tightly-knit communities.
    """,

    """
    Preferential Attachment and Scale-Free Networks

    Preferential attachment is a mechanism by which networks grow over time. When a new
    node joins the network, it forms connections preferentially to nodes that already
    have many connections - the "rich get richer" phenomenon.

    The Barabási-Albert model (1999) demonstrated that preferential attachment leads to
    scale-free networks with power law degree distributions. In such networks, P(k) ~ k^(-γ)
    where γ is typically between 2 and 3.

    Scale-free networks are characterized by the presence of hubs - nodes with
    disproportionately high degree. Examples include the Internet, citation networks,
    and social media platforms.
    """,

    """
    Small-World Networks

    Small-world networks, introduced by Watts and Strogatz (1998), exhibit two key properties:
    high clustering (like regular lattices) and short path lengths (like random graphs).

    The small-world phenomenon, popularly known as "six degrees of separation," refers to
    the observation that most pairs of nodes in many real networks are connected by short
    paths, typically of length 6 or less.

    The Watts-Strogatz model generates small-world networks by starting with a ring lattice
    and randomly rewiring edges with some probability p. For intermediate values of p,
    the network maintains high clustering while gaining shortcuts that reduce path length.
    """,

    """
    Centrality Measures

    Centrality measures identify the most important nodes in a network. Different
    centrality measures capture different notions of importance:

    1. Degree centrality: Simply counts the number of connections. High degree nodes
       are "hubs" that directly connect to many others.

    2. Betweenness centrality: Counts how often a node lies on shortest paths between
       other nodes. High betweenness nodes are "bridges" that control information flow.

    3. Closeness centrality: Measures the average distance from a node to all others.
       High closeness nodes can reach the network quickly.

    4. Eigenvector centrality: A node is important if it connects to other important
       nodes. This is the principle behind Google's PageRank algorithm.
    """,

    """
    Community Detection

    Community detection aims to identify groups of nodes that are more densely connected
    internally than to the rest of the network. Communities represent functional modules
    or social groups.

    The modularity measure quantifies the quality of a partition. For partition C,
    modularity Q = (1/2m) * Σ[A_ij - (k_i*k_j/2m)] * δ(c_i, c_j), where m is the total
    number of edges, A_ij is the adjacency matrix, and δ(c_i, c_j) = 1 if nodes i and j
    are in the same community.

    Popular algorithms include the Louvain method, which greedily optimizes modularity,
    and spectral methods based on the graph Laplacian. Community structure appears in
    social networks (friend groups), biological networks (protein complexes), and
    information networks (topic clusters).
    """,

    """
    Network Resilience and Robustness

    Network resilience refers to a network's ability to maintain functionality when nodes
    or edges are removed. This is crucial for infrastructure networks, communication systems,
    and biological systems.

    Scale-free networks exhibit a peculiar property: they are robust to random failures
    but vulnerable to targeted attacks. Removing random nodes has little effect because
    most nodes have low degree. However, removing hubs quickly fragments the network.

    The resilience of random graphs is more uniform - they handle both random and targeted
    removal similarly. Measuring resilience involves computing the size of the largest
    connected component as nodes are progressively removed.
    """,

    """
    Network Formation and Strategic Link Formation

    In many contexts, network connections form through strategic decisions by self-interested
    agents. Game-theoretic models of network formation consider the costs and benefits of
    forming links.

    Jackson and Wolinsky (1996) introduced the concept of pairwise stability: a network is
    pairwise stable if no agent wants to sever a link and no pair of agents wants to form
    a new link.

    The tension between individual incentives and social welfare is central to network
    formation theory. The star network, where one central node connects to all others,
    often minimizes total distance (socially optimal) but may not be pairwise stable
    without appropriate transfer payments.
    """,

    """
    Information Diffusion and Cascades

    Information, behaviors, and innovations spread through networks via social influence.
    The threshold model (Granovetter, 1978) posits that individuals adopt a behavior when
    a threshold fraction of their neighbors have adopted it.

    A cascade occurs when an initial adoption by a few nodes triggers a chain reaction
    of adoptions throughout the network. Whether cascades occur depends on the distribution
    of thresholds and the network structure.

    Weak ties play a crucial role in diffusion (Granovetter, 1973). While strong ties
    create dense, clustered groups, weak ties serve as bridges between communities and
    enable information to reach new populations. This explains why job seekers often find
    opportunities through acquaintances rather than close friends.
    """
]

println("Created knowledge base with $(length(network_docs)) documents")

Created knowledge base with 8 documents

Now let’s create our vector store and add these documents:

# Initialize vector store
network_store = VectorStore()

# Add documents to store (this will take a moment as we embed each document)
add_documents!(network_store, network_docs)

Embedding 8 documents...
Done! Store now contains 8 documents.

Testing Our Network Analysis Assistant¶

Let’s test our RAG system with some questions about networks:

# Question 1: About clustering
q1 = "What is the clustering coefficient and how is it calculated?"
result1 = rag_query(network_store, q1)

println("Question: ", q1)
println("\nAnswer:\n", result1.answer)
println("\n" * "="^80)
println("Top sources used:")
for i in 1:min(2, nrow(result1.sources))
    println("\n[$i] (similarity: $(round(result1.sources.similarity[i], digits=3)))")
    println(result1.sources.document[i][1:min(200, end)] * "...")
end

Question: What is the clustering coefficient and how is it calculated?

Answer:
The clustering coefficient quantifies how much a node’s neighbors are connected to each other (how “clustered” they are).

- Local clustering coefficient for node i: C_i = 2*E_i / (k_i * (k_i - 1))
  - k_i: degree of node i (number of neighbors)
  - E_i: number of edges that actually exist between those neighbors
  - It measures the fraction of possible neighbor-to-neighbor links that are present.

- Global clustering coefficient: the average of the local clustering coefficients over all nodes.

High values indicate tightly knit groups, common in social networks.

================================================================================
Top sources used:

[1] (similarity: 0.765)
Clustering Coefficient

The clustering coefficient measures the degree to which nodes in a graph tend to
cluster together. For a given node, it is defined as the proportion of connections
between the ...

[2] (similarity: 0.416)
Community Detection

Community detection aims to identify groups of nodes that are more densely connected
internally than to the rest of the network. Communities represent functional modules
or social...

# Question 2: About scale-free networks
q2 = "Why are scale-free networks vulnerable to targeted attacks?"
result2 = rag_query(network_store, q2)

println("Question: ", q2)
println("\nAnswer:\n", result2.answer)

Question: Why are scale-free networks vulnerable to targeted attacks?

Answer:
Because scale-free networks rely on a few high-degree hubs to hold the network together. Targeted attacks that remove these hubs quickly fragment the network and shrink the largest connected component, causing a rapid loss of connectivity, even though random failures (which mostly hit low-degree nodes) have little effect.

# Question 3: About weak ties
q3 = "What role do weak ties play in information diffusion?"
result3 = rag_query(network_store, q3)

println("Question: ", q3)
println("\nAnswer:\n", result3.answer)

Question: What role do weak ties play in information diffusion?

Answer:
Weak ties act as bridges between otherwise separate, densely clustered groups. By connecting different communities, they let information travel beyond local circles, reaching new populations and enabling broader cascades of adoption. This is why people often hear about job opportunities through acquaintances rather than close friends.

Comparing RAG vs Vanilla LLM¶

Let’s demonstrate the value of RAG by comparing answers with and without retrieved context:

function vanilla_llm_query(query::String; model::String="gpt-5")
    """
    Query LLM without RAG (no retrieved context)
    """
    return call_gpt(
        query;
        instructions="You are a helpful assistant.",
        model=model
    )
end

# Ask a specific question that requires precise information
specific_q = "What is the formula for local clustering coefficient in a graph?"

println("Question: ", specific_q)
println("\n" * "="^80)
println("Vanilla LLM Answer:")
println(vanilla_llm_query(specific_q))
println("\n" * "="^80)
println("RAG Answer:")
println(rag_query(network_store, specific_q).answer)

Question: What is the formula for local clustering coefficient in a graph?

================================================================================
Vanilla LLM Answer:
For an undirected, unweighted graph, the local clustering coefficient of a node v is:

C(v) = 2 e_v / [k_v (k_v − 1)]

- k_v = degree of v (number of neighbors)
- e_v = number of edges between neighbors of v (equivalently, number of triangles involving v)

By convention, C(v) = 0 if k_v < 2.

Note (directed variant): For directed graphs, the maximal number of possible edges among neighbors is k_v (k_v − 1), so a common definition is C(v) = e_v / [k_v (k_v − 1)], where e_v counts directed edges among neighbors.

================================================================================
RAG Answer:
C_i = 2 E_i / [k_i (k_i − 1)], where k_i is the degree of node i and E_i is the number of edges between its neighbors.

Observations:

Both answers are likely correct (this is a well-known formula in the LLM’s training data). However:

With RAG, we know exactly where the answer came from (we can cite sources)
With RAG, if we had proprietary or recent information, the LLM couldn’t answer without it
With RAG, we can update knowledge without retraining the model

Let’s test with a question the LLM definitely doesn’t know:

# Add a "research paper" about a fictional concept
fictional_doc = ["""
Recent work by Spencer Lyon (2025) introduced the concept of "resonance centrality"
for dynamic networks. Unlike traditional centrality measures which focus on static
network structure, resonance centrality captures how a node's importance oscillates
over time due to feedback loops.

The resonance centrality R_i(t) for node i at time t is computed as:
R_i(t) = β * Σ_j A_ij(t) * R_j(t-1) + (1-β) * d_i(t)

where A_ij(t) is the time-varying adjacency matrix, d_i(t) is the degree at time t,
and β ∈ [0,1] controls the memory of past centrality.

Lyon showed that in social networks, nodes with high resonance centrality are often
"trend catalysts" - they don't just spread information, but trigger cascades that
feed back to amplify their influence in subsequent time periods.
"""]

add_documents!(network_store, fictional_doc)

# Now ask about it
fictional_q = "What is resonance centrality and how is it calculated?"

println("Question: ", fictional_q)
println("\n" * "="^80)
println("Vanilla LLM Answer:")
println(vanilla_llm_query(fictional_q))
println("\n" * "="^80)
println("RAG Answer:")
println(rag_query(network_store, fictional_q).answer)

Embedding 1 documents...
Done! Store now contains 9 documents.
Question: What is resonance centrality and how is it calculated?

================================================================================
Vanilla LLM Answer:
Resonance centrality is a frequency-dependent notion of node importance for networks with dynamical processes. It ranks nodes by how strongly they respond to an oscillatory (periodic) input that propagates through the network dynamics. In other words, a node is “resonantly central” if the dynamics make it amplify signals at some frequency more than other nodes.

How it is defined (general idea)
- Choose a linear dynamical model on the network (first- or second-order), and a vector b describing which nodes are externally driven.
- Drive the system with a sinusoidal input u(t) = Re{u0 e^{iωt}}.
- Compute the steady-state frequency response (transfer function) from the input to each node. The magnitude of that response at node i is its frequency-dependent centrality gi(ω).
- Resonance centrality of node i is then taken as, e.g., the peak response maxω gi(ω) (and the corresponding ω is the node’s resonant frequency). Some variants use the squared magnitude or the area under the curve across ω.

Common formulations

1) First-order (diffusive/consensus-like) dynamics
- Dynamics: x′(t) = A x(t) + b u(t) with A stable (often A = −L for a Laplacian L of an undirected graph).
- Frequency response: H(ω) = (iωI − A)^{-1} b.
- Node-i response magnitude: gi(ω) = |ei^T H(ω)| = |[(iωI − A)^{-1} b]_i|.
- Resonance centrality (peak gain): Ri = maxω gi(ω).

Notes:
- If A = −L with L ≥ 0 (diffusion), the response typically peaks at low frequency (ω → 0). Then Ri ≈ |[L^† b]_i| (L^† is the Moore–Penrose pseudoinverse), so “resonance” reduces to a low-frequency (quasi-static) sensitivity, closely related to resistance-based measures.

2) Second-order (oscillator/inertial) dynamics
- Dynamics: M x″(t) + C x′(t) + K x(t) = b u(t), with M, C, K ≽ 0 (e.g., coupled oscillators, power grids).
- Dynamic stiffness: Z(ω) = K − ω^2 M + iω C.
- Frequency response: H(ω) = Z(ω)^{-1} b.
- Node-i response magnitude: gi(ω) = |ei^T H(ω)|.
- Resonance centrality: Ri = maxω gi(ω). True resonance peaks occur near the system’s natural frequencies (eigenvalues of (M, K) with damping from C).

Spectral form (useful for computation and insight)
- For first-order A diagonalizable: A = VΛV^{-1}, H(ω) = V (iωI − Λ)^{-1} V^{-1} b, so
  gi(ω) = |∑k (vik wk)/(iω − λk)|, where wk are components of V^{-1}b.
- Peaks occur where |iω − λk| is small and the mode k is excited (wk large) and observed at node i (vik large).

Practical computation
- Choose dynamics (A, or M, C, K) and the input profile b (uniform, single node, or application-specific).
- Pick a frequency grid ω ∈ [ωmin, ωmax] covering relevant dynamics (near natural frequencies if known).
- For each ω, solve a linear system:
  - First-order: y(ω) = (iωI − A)^{-1} b.
  - Second-order: y(ω) = (K − ω^2 M + iω C)^{-1} b.
- Take gi(ω) = |y_i(ω)|. Define Ri = maxω gi(ω) (and optionally record the argmax for the resonant frequency).
- Variants: use squared magnitude, L2 norm over frequencies ∫ gi(ω)^2 dω, or multiple input profiles and average.

Connections to other centralities
- Setting ω = 0 in first-order systems recovers static sensitivity measures linked to resistance distance and current-flow centralities.
- Evaluating (sI − A)^{-1} at complex s relates resonance centrality to resolvent/Katz-type centralities; resonance centrality is the frequency-response version (s on the imaginary axis).

In short: resonance centrality is the node-wise peak (or frequency-shaped) gain of the network’s transfer function under oscillatory driving, computed via inverses of (iωI − A) for first-order or of dynamic stiffness matrices for second-order dynamics.

================================================================================
RAG Answer:
Resonance centrality is a dynamic-network centrality measure that captures how a node’s importance oscillates over time due to feedback loops. It emphasizes not just current connections but how past influence reverberates and amplifies across time.

Calculation:
For node i at time t:
R_i(t) = β * Σ_j A_ij(t) * R_j(t−1) + (1 − β) * d_i(t)

- A_ij(t): time-varying adjacency (connection from j to i at time t)
- R_j(t−1): resonance centrality of node j at the previous time step
- d_i(t): degree of node i at time t
- β ∈ [0,1]: memory parameter controlling how much past centrality influences the present

Interpretation:
- Higher β gives more weight to past influence propagating through the network (feedback).
- Lower β leans more on the node’s current degree.
In social networks, nodes with high resonance centrality tend to be “trend catalysts” that trigger cascades which amplify their future influence.

What happened?

The vanilla LLM either says it doesn’t know, or hallucinates a plausible-sounding definition
The RAG system correctly retrieves and uses our fictional paper
This demonstrates how RAG enables working with proprietary, recent, or domain-specific knowledge

Trade-offs and Design Considerations¶

Now that we’ve built a working RAG system, let’s discuss key design decisions and trade-offs.

Parametric vs Retrieval-Based Knowledge¶

Parametric Knowledge (in model weights):

Pros:
- Fast: no need to search external sources
- Synthesizes information from vast training corpus
- Good for general knowledge and reasoning
Cons:
- Fixed at training time
- Can’t update without retraining
- May hallucinate when uncertain
- No citations or provenance

Retrieval-Based Knowledge (via RAG):

Pros:
- Can use up-to-date information
- Access to proprietary/specialized knowledge
- Provides citations and sources
- Reduces hallucination
- Updates without retraining
Cons:
- Slower (requires retrieval step)
- Quality depends on retrieval accuracy
- Requires maintaining vector database
- May miss information requiring synthesis across many sources

Key Design Decisions¶

1. Chunk Size

Smaller chunks (100-200 words):
- More precise retrieval
- Better for question answering
- May lose context across chunk boundaries
Larger chunks (500-1000 words):
- Preserve more context
- Better for complex reasoning
- May include irrelevant information

2. Number of Retrieved Documents (top_k)

Few documents (k=1-3):
- Focuses on most relevant information
- Faster, uses less context window
- May miss important information
Many documents (k=5-10):
- More comprehensive context
- Better for complex queries
- May include noise, uses more tokens

3. Embedding Model

Smaller models (text-embedding-3-small: 512-1536 dims):
- Faster embedding and search
- Lower cost
- Slightly lower quality
Larger models (text-embedding-3-large: 3072 dims):
- Better semantic understanding
- More accurate retrieval
- Higher cost and latency

Advanced RAG Techniques¶

Our implementation is a basic RAG pipeline. Production systems often use:

1. Hybrid Search

Combine vector similarity with keyword search (BM25)
Leverages both semantic and lexical matching
More robust to different query types

2. Re-ranking

Use a separate model to re-rank retrieved results
Can consider query-document interaction more deeply
Improves precision of top results

3. Query Expansion

Generate multiple variations of user query
Search with all variations
Helps with vocabulary mismatch between query and documents

4. Hierarchical Retrieval

First retrieve at document level
Then retrieve specific passages within those documents
Balances context and precision

5. Iterative RAG

Let LLM decide what to retrieve next
Multiple retrieval rounds
Better for complex, multi-hop reasoning

Connection to Course Themes¶

RAG systems connect to several course concepts:

Networks and Information Flow

RAG creates edges from LLM to knowledge sources
Quality of answer depends on network structure (which documents are connected via similarity)
Similar to how social network structure affects information diffusion

Game Theory and Mechanism Design

In multi-agent RAG systems, agents must decide what information to share
Strategic information retrieval: costly to retrieve everything, must be selective
Related to costly network formation from our game theory module

Agent-Based Modeling

Can model RAG systems as agents with retrieval actions
Environment is the knowledge base
Rules govern when to retrieve, what to retrieve, how to synthesize
Next lecture will explore multi-agent AI systems in depth

Exercise 2: Build Your Own RAG System¶

Choose one of the following domains and build a RAG system:

Option A: Game Theory Assistant

Create documents summarizing key game theory concepts from our Week 8-9 lectures
Include: Nash equilibrium, dominant strategies, mixed strategies, auctions
Test with questions like “What is a Nash equilibrium?” or “Explain the winner’s curse”

Option B: Julia Programming Helper

Create documents with Julia programming tips and common patterns
Include: array operations, data frames, plotting, packages we’ve used
Test with questions like “How do I create a histogram?” or “What’s the syntax for filtering a DataFrame?”

Option C: Your Domain

Choose a topic you’re interested in
Create 5-10 documents with information
Build and test a RAG system

For your chosen option:

Create at least 5 documents (can be shorter than our examples)
Build a vector store
Test with at least 3 different questions
Compare RAG answers vs vanilla LLM answers
Experiment with different values of top_k (number of retrieved documents)

# TODO: your code here

# Step 1: Create your documents
my_docs = [
    # Your documents here
]

# Step 2: Build vector store
my_store = VectorStore()
# add_documents!(my_store, my_docs)

# Step 3: Test with questions
# test_q1 = "..."
# result = rag_query(my_store, test_q1)
# println(result.answer)

Exercise 3: Evaluating Retrieval Quality¶

An important aspect of RAG systems is retrieval quality - are we finding the right documents?

Using our network_store:

Create 3 test queries where you know which document(s) should be retrieved
For each query, perform retrieval with different values of k (k=1, 3, 5)
Examine the similarity scores - what do they tell you?
Try a query that requires information from multiple documents
Try a query that’s completely off-topic (e.g., about cooking) - what gets retrieved?

Questions to explore:

Does the highest similarity document always have the best answer?
How much does similarity score drop from rank 1 to rank 3?
What happens when the query uses different terminology than the documents?

# TODO: your analysis here

# Example structure:
# test_queries = [
#     "query 1...",
#     "query 2...",
#     "query 3..."
# ]

# for query in test_queries
#     println("Query: ", query)
#     results = search(network_store, query; top_k=5)
#     # Analyze results
# end

Exercise 4: Design Decisions¶

Consider the following scenarios and discuss what RAG design choices you would make:

Scenario 1: Legal Document Q&A

Database: 10,000 legal contracts and case documents
Queries: Lawyers asking about specific clauses and precedents
Requirements: High accuracy, must cite sources, complex reasoning

Scenario 2: Customer Support Chatbot

Database: 500 FAQ articles and product documentation
Queries: Customers asking how to use products, troubleshoot issues
Requirements: Fast responses, conversational, handles unclear questions

Scenario 3: Academic Research Assistant

Database: 100,000 research papers
Queries: Researchers asking about literature, methodologies, findings
Requirements: Comprehensive answers, identifies connections between papers

For each scenario, decide:

Chunk size (small, medium, large)?
Number of documents to retrieve (k=?)?
Would you use any advanced techniques (hybrid search, re-ranking, etc.)?
How would you handle queries that require information from multiple documents?
How would you evaluate the system’s performance?

Write your analysis below:

Your Analysis:

Scenario 1: Legal Document Q&A

Chunk size: ...
k: ...
Advanced techniques: ...
Multi-document handling: ...
Evaluation: ...

(Continue for other scenarios)

Looking Ahead: Multi-Agent Systems¶

In this lecture we built a single-agent RAG system
The LLM agent retrieves information and generates answers
But what if we had multiple AI agents working together?
Next lecture (A1.03) we’ll explore multi-agent AI systems:
- Multiple LLM agents with specialized roles
- Agents that coordinate and communicate
- Connections to agent-based models and game theory
- Building complex workflows with agentic systems

Think about:

How might multiple agents improve retrieval? (e.g., one agent generates queries, another evaluates relevance)
What if different agents had access to different knowledge bases?
How does this relate to our study of networks and strategic interaction?

Key Takeaways¶

LLMs have fundamental knowledge limitations: cutoffs, hallucination, missing domain-specific info
RAG augments LLMs with external knowledge through retrieval from vector databases
Embeddings represent text as vectors capturing semantic meaning, enabling similarity search
A RAG pipeline has four stages: chunking, embedding, retrieval, generation
Design trade-offs matter: chunk size, number of retrieved docs, embedding model all affect quality and performance
RAG enables practical AI applications with up-to-date, domain-specific, and proprietary knowledge
RAG connects to network concepts: creating dynamic information flow from LLM to knowledge sources
Citations and provenance are key benefits of RAG over pure parametric knowledge