PHASE 3 ← Back to Course
17 / 23
🧩

Memory & Context Management

Give your AI agents persistent memory — from conversation buffers to vector stores and hierarchical memory systems.

1

The Memory Problem

Every time you call an LLM, it starts completely fresh. It has no memory of anything that happened before. This is a fundamental architectural constraint — and the biggest challenge when building agents that need to maintain continuity.

The core issues are clear:

Short-term Memory

The current conversation — messages in the active context window. Lost when the session ends.

💾

Long-term Memory

Persisted across sessions — stored in databases, files, or vector stores. Survives restarts.

📸

Episodic Memory

Specific events and experiences — "the user reported a bug on Tuesday" or "we deployed v2.1 last week."

📚

Semantic Memory

General knowledge and facts — "the user prefers Python" or "the project uses PostgreSQL."

💡

Analogy: Amnesia in the Moment

Think of an LLM without memory like a person with amnesia — brilliant in the moment, but can't remember yesterday's conversation. Every interaction, you have to re-introduce yourself and re-explain the context. Memory systems are how we cure that amnesia.

2

Memory Types

There are four core memory patterns (based on Lilian Weng's framework). Each makes a different tradeoff between completeness, token efficiency, and relevance.

📋

Buffer Memory

Store the full conversation history. Simple and complete, but grows unbounded and eventually overflows the context window.

🪟

Window Memory

Keep only the last N turns. Caps token usage but loses older context — the agent forgets early parts of the conversation.

📝

Summary Memory

Use the LLM itself to summarize older messages. Preserves key information in fewer tokens, but lossy by nature.

🔍

Vector Memory

Embed messages as vectors and retrieve only relevant ones via similarity search. Scales to massive histories.

Buffer Memory — the simplest approach. Store everything:

Python
class BufferMemory:
    def __init__(self):
        self.messages = []

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})

    def get_context(self) -> list[dict]:
        return self.messages.copy()

Summary Memory — compress older messages using the LLM:

Python
class SummaryMemory:
    def __init__(self, client):
        self.client = client
        self.summary = ""
        self.recent = []  # Last 3 turns

    def add(self, role: str, content: str):
        self.recent.append({"role": role, "content": content})
        if len(self.recent) > 6:  # 3 turns = 6 messages
            self._compress()

    def _compress(self):
        old = self.recent[:4]
        self.recent = self.recent[4:]
        # Ask LLM to update summary
        resp = self.client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=300,
            messages=[{
                "role": "user",
                "content": f"Current summary:\n{self.summary}\n\n"
                           f"New messages:\n{old}\n\n"
                           "Update the summary concisely."
            }]
        )
        self.summary = resp.content[0].text
3

Context Window Strategies

Making the most of a limited context window is critical. Here are the main strategies, from simplest to most sophisticated:

StrategyHow It WorksTradeoff
Truncation Drop the oldest messages when the window fills up Simple but loses early context completely
Sliding Window Keep only the last N tokens of conversation Predictable cost, but no long-term recall
Summarization Compress history into a rolling summary Retains key info, but lossy — details get dropped
RAG Retrieve only relevant context from a vector store Scales well, but retrieval quality is critical
Hierarchical Combine summary + recent messages + retrieved context Best quality, but most complex to implement

A well-designed context window allocates space for each component. Here's a typical layout:

System Prompt
Summary
Retrieved Context
Recent Messages
Current Input
Response Space
← Total Context Window →
⚠️

Don't Fill the Entire Window

Never fill the entire context window — always leave room for the model's response. A good rule: use no more than 75% for input. If you stuff the window full, the model either truncates its response or produces degraded output.

4

Vector Memory

Vector memory uses embeddings to convert text into numerical vectors, then retrieves relevant memories via similarity search. The flow is: text → embedding model → vector → store in database → query by similarity.

This approach mirrors how human memory works — you don't recall every conversation you've ever had. Instead, the current context triggers retrieval of relevant memories.

Python
import chromadb

client = chromadb.Client()
collection = client.create_collection("memories")

# Store a memory
collection.add(
    documents=["User prefers Python over JavaScript"],
    ids=["mem_001"],
    metadatas=[{"type": "preference", "date": "2025-01-15"}]
)

# Retrieve relevant memories
results = collection.query(
    query_texts=["What programming language should I recommend?"],
    n_results=3
)

# Inject into prompt
memories = "\n".join(results["documents"][0])
system = f"User context:\n{memories}\n\nBe helpful."

Why Vector Memory Matters

Vector memory is the closest thing to how humans remember — you don't recall everything, just what's relevant to the current situation. A well-tuned vector store can search across thousands of past interactions and surface exactly the context the agent needs.

5

Advanced Patterns

Production memory systems typically combine multiple memory types into a hierarchical architecture. This gives you the best of all worlds — recent context is complete, older context is summarized, and everything else is retrievable via semantic search.

Python
class HierarchicalMemory:
    """Combines buffer + summary + vector memory."""

    def __init__(self, client, vector_store):
        self.client = client
        self.vector_store = vector_store
        self.recent = []          # Last 5 turns (buffer)
        self.summary = ""         # Rolling summary

    def add(self, role: str, content: str):
        self.recent.append({"role": role, "content": content})
        # Also store in vector DB for long-term retrieval
        self.vector_store.add(
            documents=[content],
            ids=[f"msg_{len(self.recent)}"],
            metadatas=[{"role": role, "type": "conversation"}]
        )
        if len(self.recent) > 10:
            self._compress()

    def get_context(self, current_query: str) -> str:
        # 1. Retrieve relevant past memories
        retrieved = self.vector_store.query(
            query_texts=[current_query], n_results=5
        )
        relevant = "\n".join(retrieved["documents"][0])

        # 2. Assemble hierarchical context
        return (
            f"## Conversation Summary\n{self.summary}\n\n"
            f"## Relevant Past Context\n{relevant}\n\n"
            f"## Recent Messages\n{self.recent}"
        )

    def _compress(self):
        old = self.recent[:6]
        self.recent = self.recent[6:]
        resp = self.client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=300,
            messages=[{
                "role": "user",
                "content": f"Current summary:\n{self.summary}\n\n"
                           f"New messages:\n{old}\n\n"
                           "Update the summary concisely."
            }]
        )
        self.summary = resp.content[0].text
🏷️

Entity Memory

Track facts about specific entities — users, projects, files. Update structured records over time.

🪞

Reflection

Agent periodically reflects on its memories, synthesizing higher-level insights from raw experiences.

🕳️

Forgetting

Decay old or low-importance memories. Not everything is worth remembering — controlled forgetting prevents noise.

6

Implementation Best Practices

Building memory systems is where theory meets production. These are the lessons that matter most:

📈

Start with Buffer, Scale to Vector

Simple buffer memory works for short conversations. Don't over-engineer — add vector memory only when you actually need to handle long histories or cross-session recall.

🧱

Separate Memory Concerns

Keep short-term, long-term, and working memory as distinct systems. Mixing them creates tangled code that's hard to debug and tune.

🎯

Always Set Token Budgets

Allocate fixed token budgets for each memory component — e.g., 500 tokens for summary, 1000 for retrieved context, 2000 for recent messages.

🧪

Test Memory Retrieval

Bad retrieval is worse than no memory — irrelevant context confuses the model. Test that your vector search actually returns useful results.

Check Your Understanding

Quick Quiz — 3 Questions

1. Why can't LLMs remember previous conversations by default?

2. When would you choose summary memory over buffer memory?

3. What's the main advantage of vector memory?

Topic 17 Summary

Here's what you've learned:

LLMs are stateless — memory must be explicitly managed. The four core memory types are buffer (full history), window (last N turns), summary (compressed history), and vector (embedding-based retrieval). Production systems use hierarchical memory that combines all three. Always set token budgets for each memory component and leave at least 25% of the context window for the model's response.

Next up → Topic 18: Agent Frameworks
You'll learn about LangChain, LangGraph, CrewAI, and other frameworks that provide memory, tool use, and orchestration out of the box.

← Previous Topic 17 of 23 Next: Agent Frameworks →