LLM Context Engineering: Give AI the Right Info, Waste Fewer Tokens

freshco.techUncategorized1 month ago38 Views

LLM Context Engineering: Give AI the Right Info, Waste Fewer Tokens

Here’s a weird fact: the single biggest factor in whether an AI gives you a great answer or a garbage one isn’t the model you picked. It’s not the temperature setting. It’s what you put in its context window.

Think of context as the AI’s working memory. Everything you put in there — instructions, examples, conversation history, documents, tool definitions — competes for the model’s attention. Input too much and it gets distracted, input too little and it doesn’t have enough to work with. Put in the wrong things and it confidently gives you a beautifully written wrong answer.

This is context engineering: the discipline of deciding what goes into an LLM’s context window, how it’s structured, and what gets left out. It matters for everyone — from beginner using ChatGPT for the first time to a developer building AI agents that autonomously complete multi-step tasks. Context matter to every user when it comes to LLMs and AI agents.

Let’s break down how to increase context efficiency.

Also see: [How To Get Into Crypto? Roadmap For A Web3 Developer](https://freshco.tech/how-to-get-into-crypto-roadmap-for-a-web3-developer/) — another deep dive for technical learners.

What Actually Lives in Context

Every time you send a message to an LLM, the model doesn’t just see your question. It sees a bundle of information that might include:

System prompt — the “you are a helpful assistant” instructions, formatting rules, personality, and safety guidelines
Conversation history — every back-and-forth between you and the model in the current session
Retrieved documents — chunks of text pulled from a knowledge base to answer your question (RAG)
Tool definitions — for agents, descriptions of every function the model can call, including their parameters
Memory — facts about you the system has saved across sessions
Your actual message — the question or task you just asked

All of this counts against the model’s context limit. GPT-4o has 128,000 tokens. Claude has 200,000. Gemini 2.5 Pro has 1,000,000. Those numbers sound huge — but an AI agent in the middle of a complex task can burn through 30,000 tokens in a single turn just from tool definitions and conversation history.

Every token costs money and degrades quality slightly. Longer contexts make models slower, more expensive, and — counterintuitively — less accurate on the specific thing you care about, because their attention gets diluted across irrelevant information.

The Art of the System Prompt

The system prompt is the most permanent resident of your context window. It sits at the top of every single call. So every wasted word in your system prompt gets paid for over and over again.

Bad system prompt (wordy, repetitive, full of obvious instructions):

“You’re a highly intelligent, knowledgeable, and helpful AI assistant. Your job is to assist users with their questions and tasks to the best of your ability. You should always try to be helpful, accurate, and clear in your responses. You should never be rude or unhelpful. Please respond in a friendly and professional manner at all times…”

Good system prompt (tight, specific, declarative):

> “You are a technical support engineer. Answer with step-by-step solutions. If you don’t know, say so — don’t guess. Use bullet points for multi-step answers.”

The second version is a third of the length and three times more useful. Every word earns its place. The model doesn’t need to be told to “be helpful” — that’s the default. It needs to be told how to be helpful in this specific context.

Rules for system prompts

Cut adjectives and flattery. “You are a brilliant, exceptional, world-class expert in…” is just token waste. The model doesn’t have an ego.
Use declarative statements. “Respond in JSON” not “I would appreciate it if you could try to respond in JSON format if possible.”
Show, don’t just tell. One good example is worth a paragraph of instructions. If you want a specific format, include one example of that format.
Put critical instructions at the end. Models pay more attention to the beginning and the end of the prompt. Safety rules at the top, format instructions at the bottom.
Separate sections clearly. Use markdown headers (## Instructions, ## Examples, ## Output Format) so the model can structurally parse what’s what.

Smart Context Engineering: 5 Ways to Reduce Token Waste

This is the part that saves you real money and makes your AI applications actually work at scale.

1. Summarize instead of accumulate

The simplest technique: when conversation history gets long, replace the oldest messages with a summary. Instead of 20 turns of back-and-forth about debugging a Python error, you get one paragraph:

“The user had a KeyError in their pandas script. We identified the missing column, fixed the code, and verified it works. Now they’re asking about optimization.”

That’s 200 tokens doing the work of 3,000. The model keeps the context of what happened without drowning in the details of how it happened.

Most AI agent frameworks do this automatically when you approach the context limit. But you can trigger it earlier — summarize after every major task completion, not just when you’re about to hit the wall.

2. Be selective instead of comprehensive

Retrieval-Augmented Generation (RAG) is the standard approach here: instead of dumping your entire knowledge base into context, you search for only the most relevant chunks. The art is in the retrieval quality.

Common mistake: “I’ll just add 20 chunks to be safe, the model can handle it.” It can’t — not well. Research consistently shows that LLM accuracy drops as irrelevant information increases in context, even when the relevant information is also present. This is called the “lost in the middle” problem — models pay most attention to the beginning and end of context, and information in the middle gets overlooked.

Practical rule: retrieve fewer, higher-quality chunks. 3-5 highly relevant chunks beat 20 loosely relevant ones every time.

3. Use sliding windows

For real-time applications, keep only the last N messages in full detail. Everything older gets summarized or dropped. The window size depends on your use case:

Customer support chat: keep last 10 messages
Code debugging session: keep last 15 messages (code context decays slower)
Creative writing: keep last 5 messages (you want fresh thinking)

4. Compress your prompts

Prompt compression is an emerging technique where you run a smaller, faster model to rewrite a long prompt into a shorter version that preserves the essential meaning. Tools like [LLMLingua](https://arxiv.org/abs/2310.05736) (from Microsoft) can compress prompts by 2-5x while keeping task performance nearly identical.

This is especially useful for agents that build up long context through tool calls — compress the tool outputs before they go back into context.

5. Cache what doesn’t change

If you’re sending the same system prompt and tool definitions on every call, you’re paying for those tokens every single time. [Anthropic](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) and [Google](https://ai.google.dev/gemini-api/docs/caching) both offer context caching — you mark a prefix of your context as cached, and subsequent calls only pay for the new tokens. This can reduce costs by 75-90% for applications with long, stable system prompts and tool definitions.

Agent-Specific Challenges

AI agents — models that can use tools, make multiple decisions, and complete multi-step tasks — have unique context problems.

Tool definitions are expensive

Every tool an agent can use needs a detailed description in context: the function name, what it does, every parameter, every parameter’s type, every parameter’s description. Ten tools can easily consume 3,000-5,000 tokens before the agent even starts thinking.

Fix: Give tools concise names and one-line descriptions. Remove parameters the agent won’t realistically use. If an agent has 20 tools but only needs 3 for the current task, consider loading tool definitions dynamically based on what’s relevant.

Multi-turn conversations balloon fast

An agent debugging a bug might call read_file 5 times, search_files 3 times, and terminal 4 times. Each tool call and its output gets added to context. After 5 turns, you’re at 30,000+ tokens.

Fix: Summarize tool outputs. Instead of keeping the full 2,000-line file in context, keep “File opened, lines 150-175 contain the error. Function signature is def process(data: dict) -> None.”

Memory needs curation

Persistent memory across sessions sounds great, but every saved fact consumes tokens. A memory store with 50 entries that are 95% obsolete is worse than no memory at all.

Fix: Review and prune memory regularly. Save only durable facts (user preferences, environment details), not task progress (“fixed the login bug on Tuesday”).

Practical Tools and Techniques

Here’s what you can actually use today:

| Tool / Technique | What it does | When to use |

| Anthropic Prompt Caching | Caches system prompts and tool defs, reducing per-call costs by 90% | Any production agent with a stable system prompt |

| Google Context Caching | Same idea for Gemini models | Gemini-based applications |

| LLMLingua | Compresses prompts 2-5x using a small model | Long prompts, agent tool outputs |

| Structured output (JSON mode) | Forces model to output in a specific schema, reducing correction retries | Any data extraction or API-calling task |

| Dynamic tool loading | Only load tools relevant to the current task | Agents with large tool libraries (10+ tools) |

| Token counting during development | Count actual tokens used per call, set budgets | Every project — you can’t optimize what you don’t measure |

The Golden Rules

If you remember nothing else, remember these:

Every token is a vote. More irrelevant tokens dilute the model’s attention on what matters.
System prompts are real estate. Premium space at the top of every call. Make every word earn its place.
Summarize aggressively. A good summary does the work of 10x its token count.
Retrieve less, retrieve better. 3 highly relevant chunks beat 20 loosely relevant ones.
Cache what repeats. System prompts and tool definitions change rarely. Don’t pay for them every time.
Measure before optimizing. Count your tokens per call before and after changes. The numbers will surprise you.
Put the important stuff at the edges. Models pay most attention to beginnings and endings. Don’t bury your key instruction in paragraph six.

The Bottom Line

Context engineering isn’t glamorous. It doesn’t get conference keynotes or breathless blog posts about “reasoning breakthroughs.” But in practice, it matters more than almost anything else you can tune. A well-engineered context on a mid-tier model will consistently outperform a bloated context on a frontier model.

The difference between an AI agent that works reliably and one that breaks after four turns isn’t usually the model — it’s whether someone bothered to manage the context budget.

Think about context like packing for a trip. You have a finite suitcase. Everything you pack, you have to carry. The art isn’t in packing more — it’s in knowing what to leave behind.

Want to dig deeper? Check out [Anthropic’s prompt engineering guide](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering), [Microsoft’s LLMLingua paper](https://arxiv.org/abs/2310.05736), and [Google’s context caching documentation](https://ai.google.dev/gemini-api/docs/caching) for implementation details specific to each platform.

Related: [Beginners Guide: What Is the Ethereum Blockchain?](https://freshco.tech/beginners-guide-what-is-the-ethereum-blockchain/) — another foundational tech explainer.

Upvote0PointsDownvote

0 Votes: 0 Upvotes, 0 Downvotes (0 Points)