Ultimate AI Glossary: 60+ Terms Every Developer Must Know

From Transformers to RAG, Agents to Embeddings — decoded.

Whether you're diving into your first machine learning project or architecting enterprise AI systems, the landscape of AI terminology can feel overwhelming. This glossary cuts through the noise with clear, developer-friendly definitions — organized by category so you can navigate what you need.

Bookmark this. You'll be back.

🧠 Foundation: Core AI Concepts

Artificial Intelligence (AI)

The broad field of building systems that can perform tasks typically requiring human intelligence — reasoning, understanding language, recognizing patterns, and making decisions. AI is the umbrella; everything else in this glossary lives under it.

Machine Learning (ML)

A subset of AI where systems learn from data rather than following explicitly programmed rules. Instead of writing if (temperature > 100) return "hot", you feed examples and let the algorithm figure out the pattern.

Deep Learning (DL)

A subset of machine learning that uses neural networks with many layers (hence "deep"). Deep learning powers most modern breakthroughs — image recognition, speech synthesis, large language models.

Neural Network

A computational model loosely inspired by the human brain. It consists of interconnected nodes (neurons) organized in layers that transform input data into output predictions. Each connection has a weight that gets tuned during training.

Algorithm

A set of rules or instructions that a model follows to make decisions or learn from data. In ML, the algorithm defines how the model learns — gradient descent, backpropagation, etc.

Model

The trained artifact that results from running an ML algorithm on data. When people say "deploy the model," they mean the weights and architecture that now encode learned knowledge.

📊 Data & Training

Training Data

The dataset used to teach a model. Quality and quantity both matter enormously. Biased training data → biased model. Insufficient training data → an underfit model.

Test Data / Validation Data

Held-out datasets used to evaluate model performance after training. Validation data guides hyperparameter tuning during training; test data gives the final performance estimate.

Overfitting

When a model learns the training data too well — including its noise and quirks — and fails to generalize to new data. Classic symptom: 99% training accuracy, 60% test accuracy.

Underfitting

The opposite problem: the model is too simple to capture the underlying patterns. Both low training accuracy and low test accuracy.

Supervised Learning

Training with labeled examples — input/output pairs. The model learns to map inputs to correct outputs. Most classification and regression tasks are supervised.

Unsupervised Learning

Training on unlabeled data to discover hidden structure. Clustering (grouping similar items) and dimensionality reduction are common unsupervised tasks.

Reinforcement Learning (RL)

A paradigm where an agent learns by taking actions in an environment and receiving rewards or penalties. Used in game-playing AIs (AlphaGo) and increasingly in fine-tuning LLMs.

Fine-Tuning

Taking a pre-trained model and continuing to train it on a smaller, task-specific dataset. Much cheaper than training from scratch and usually yields excellent results for specialized domains.

RLHF (Reinforcement Learning from Human Feedback)

A fine-tuning technique where human raters score model outputs, and those scores train a reward model that guides further RL training. Core to how models like Claude and ChatGPT are aligned.

Batch Size

The number of training examples processed together before updating model weights. Larger batches = more stable gradients but more memory. Smaller batches = noisier gradients but potentially better generalization.

Epoch

One complete pass through the entire training dataset. Training typically runs for multiple epochs.

Learning Rate

A hyperparameter that controls how much to adjust model weights per update. Too high = unstable training. Too low = painfully slow convergence.

🤖 Large Language Models (LLMs)

Large Language Model (LLM)

A neural network trained on massive text corpora to understand and generate human language. "Large" refers to billions of parameters. GPT-4, Claude, Gemini, and Llama are LLMs.

Transformer

The neural network architecture that powers virtually all modern LLMs. Introduced in the 2017 paper "Attention Is All You Need", it replaced RNNs with a mechanism called self-attention that processes all tokens simultaneously.

Attention Mechanism

The core innovation of Transformers. Allows the model to weigh the importance of different parts of the input when generating each output token. "Attending" to the right context is what makes LLMs coherent.

Token

The basic unit of text that LLMs process. Not quite words — tokens are chunks of characters (e.g., "transformer" might be one token; "unbelievable" might be two). Most LLMs use ~4 characters per token on average.

Context Window

The maximum number of tokens an LLM can process in a single interaction — both input and output combined. GPT-4 Turbo has 128K tokens; Claude has up to 200K. Larger context = better for long documents.

Prompt

The input text you send to an LLM to get a response. Prompt design significantly affects output quality — hence the discipline of prompt engineering.

Prompt Engineering

The art and science of crafting prompts to elicit better responses from LLMs. Techniques include chain-of-thought prompting, few-shot examples, role assignment, and structured output requests.

Few-Shot Prompting

Including a few examples of the task in the prompt (e.g., 2–5 input/output pairs) to help the model understand what you want without any fine-tuning.

Zero-Shot Prompting

Asking the model to perform a task with no examples — just a description. Works surprisingly well with modern LLMs due to their broad pre-training.

Chain-of-Thought (CoT)

A prompting technique where you ask the model to reason step-by-step before giving its final answer. Dramatically improves performance on multi-step reasoning and math problems.

Temperature

A parameter (0.0 to 2.0) that controls output randomness. Temperature 0 = deterministic, always picks the most likely token. Temperature 1+ = more creative and varied. For code generation, use low temperatures; for brainstorming, use higher.

Top-P (Nucleus Sampling)

An alternative to temperature for controlling randomness. Instead of adjusting probabilities, it restricts sampling to the smallest set of tokens whose cumulative probability exceeds P. top_p=0.9 is a common default.

Hallucination

When an LLM confidently generates factually incorrect information. A fundamental challenge in LLMs — they optimize for plausible text, not necessarily true text.

Grounding

Connecting model outputs to verifiable, external sources of truth to reduce hallucinations. Retrieval-Augmented Generation (RAG) is the primary grounding technique.

🔍 RAG & Retrieval

RAG (Retrieval-Augmented Generation)

An architecture that combines an LLM with a retrieval system. Instead of relying solely on training knowledge, the model retrieves relevant documents at inference time and uses them as context. Dramatically reduces hallucinations for knowledge-intensive tasks.

Vector Database

A database optimized for storing and querying embeddings (high-dimensional vectors). Used heavily in RAG systems to find semantically similar documents. Examples: Pinecone, Weaviate, Chroma, pgvector.

Embedding

A numerical vector representation of text (or images, audio, etc.) that captures semantic meaning. Similar concepts cluster together in embedding space. "cat" and "kitten" will have similar embeddings; "cat" and "blockchain" will not.

Semantic Search

Search based on meaning rather than keyword matching. Uses embeddings to find documents that are conceptually relevant, even if they don't share the exact same words.

Chunking

The process of splitting large documents into smaller pieces before embedding and storing them in a vector database. Chunk size is a critical tuning parameter in RAG — too large loses precision, too small loses context.

Reranking

A second-pass step in RAG that takes the top-k retrieved chunks and re-scores them using a more powerful (but slower) cross-encoder model, before passing the best results to the LLM.

🛠️ Agents & Tools

AI Agent

An LLM-powered system that can reason, plan, and take actions autonomously — not just generate text. Agents decide what tools to call, observe results, and iterate until a goal is achieved.

Tool Use / Function Calling

The ability for an LLM to call external functions or APIs as part of generating a response. The model outputs a structured "call this function with these arguments" rather than raw text.

Agentic Loop

The iterative cycle an AI agent follows: observe → think → act → observe → repeat, until the task is complete or a stopping condition is met.

Multi-Agent System

An architecture where multiple specialized AI agents collaborate — one might browse the web, another writes code, another reviews it. Frameworks like LangGraph and AutoGen implement this.

ReAct (Reason + Act)

A prompting framework for agents that interleaves reasoning ("Thought: ...") with actions ("Action: search[...]") and observations. Makes agent behavior more transparent and debuggable.

MCP (Model Context Protocol)

An open protocol developed by Anthropic that standardizes how AI models connect to external tools, data sources, and services. Think of it as USB-C for AI integrations — a universal interface for connecting models to the world.

Orchestration

The layer that manages the flow of an AI system — routing between agents, managing state, handling retries, and coordinating tool calls. LangChain, LlamaIndex, and LangGraph are popular orchestration frameworks.

⚙️ Model Architecture & Inference

Parameters

The learned numerical weights inside a model. "A 70B model" has 70 billion parameters. More parameters generally means more capability, but also more compute and memory.

Inference

Running a trained model to generate predictions or responses. Distinct from training. When you call an LLM API, you're doing inference.

Quantization

Reducing the numerical precision of model weights (e.g., from 32-bit floats to 4-bit integers) to decrease memory usage and speed up inference. Essential for running large models on consumer hardware.

Latency vs. Throughput

Two key inference metrics. Latency is how long a single request takes (user-facing). Throughput is how many requests per second the system handles. There's often a tradeoff.

TTFT (Time to First Token)

The latency between sending a request and receiving the first token of the response. Critical for user experience in streaming applications.

Structured Output

Constraining an LLM to generate responses in a specific format (JSON, XML, etc.) rather than free text. Used when downstream code needs to parse the response programmatically.

System Prompt

Instructions sent to an LLM that set the context, persona, or rules for the conversation — separate from the user's message. Most API-based LLMs support a dedicated system prompt field.

🎨 Generative AI (Images, Audio, Video)

Generative AI

AI systems that can create new content — text, images, audio, video, code — rather than just classifying or analyzing existing content.

Diffusion Model

The architecture behind most modern image generation models (Stable Diffusion, DALL-E, Midjourney). Works by learning to reverse a noise-addition process — starting from random noise and gradually denoising into a coherent image.

Text-to-Image

Generating images from natural language descriptions. The prompt "a photorealistic astronaut riding a horse on Mars, golden hour lighting" produces an image.

Multimodal Model

A model that can process and generate multiple types of data — text, images, audio, video. GPT-4o and Claude 3.5 are multimodal — they can see images and respond in text.

Latent Space

The compressed, abstract representation of data learned by a model. Diffusion models generate images by navigating latent space. Embeddings are points in latent space.

🔐 Safety, Alignment & Ethics

Alignment

The challenge of ensuring AI systems behave in accordance with human values and intentions. Misaligned AI does what it was literally trained to do, not necessarily what we actually want.

Constitutional AI

An Anthropic technique where a model critiques and revises its own outputs based on a set of principles (a "constitution"), reducing reliance on human feedback for every edge case.

Guardrails

Constraints applied to model inputs or outputs to prevent unsafe, harmful, or off-topic responses. Can be implemented at the prompt level, via fine-tuning, or with a separate classifier.

Jailbreak

An attempt to bypass an LLM's safety guardrails through clever prompting — often by roleplay scenarios, hypothetical framings, or encoded instructions.

Bias

Systematic errors in model outputs reflecting unfair prejudices from training data. Algorithmic bias can perpetuate or amplify societal inequalities if left unchecked.

📐 Evaluation & Metrics

Benchmark

A standardized dataset and evaluation protocol used to compare model capabilities. Common LLM benchmarks include MMLU, HumanEval (coding), and MATH.

Perplexity

A measure of how well a language model predicts a sample of text. Lower perplexity = better. Mostly used internally during training; less useful for task-specific evaluation.

BLEU / ROUGE

Automated metrics for evaluating text generation quality by comparing to reference outputs. BLEU is common for translation; ROUGE for summarization. Both have limitations — high scores don't always mean high quality.

Evals (Evaluations)

The practice of systematically testing AI model outputs against desired behavior. Moving from vibes-based to eval-driven development is the mark of a mature AI engineering team.

🚀 Deployment & Infrastructure

API (Application Programming Interface)

The interface through which you call an LLM programmatically. Send a prompt → receive a response. OpenAI, Anthropic, Google, and others expose their models via REST APIs.

Self-Hosted / On-Premises

Running an LLM on your own infrastructure rather than via a cloud API. Required for air-gapped environments, data privacy requirements, or cost optimization at scale.

GPU (Graphics Processing Unit)

The hardware backbone of AI. GPUs excel at the massively parallel matrix multiplications that neural networks require. NVIDIA's H100 and A100 are the current gold standard for AI training and inference.

Model Serving

The infrastructure that takes a trained model and makes it available as a service — handling request routing, batching, scaling, and versioning. Tools include NVIDIA Triton, vLLM, and Ray Serve.

Streaming

Returning LLM output token by token as it's generated rather than waiting for the full response. Makes UX feel much more responsive.

🧩 Quick Reference Cheat Sheet

Term	One-Line Definition
LLM	Large neural net trained on text to understand and generate language
RAG	Retrieval + generation to ground LLMs in real documents
Embedding	Numerical vector representing meaning
Token	Basic text unit an LLM processes
Fine-tuning	Adapting a pre-trained model for a specific task
Agent	LLM + tools + reasoning loop = autonomous task execution
Hallucination	Model confidently saying something false
Temperature	Controls how random/creative output is
Context Window	Max tokens the model can "see" at once
Quantization	Compressing model weights to run on less memory

Wrapping Up

AI terminology evolves fast — new terms emerge with every major paper and product launch. The best way to stay current is to read primary sources (ArXiv, research blogs from Anthropic, Google DeepMind, Meta AI), build things, and stay curious.

Got a term that should be in here? Drop a comment below.

Command Palette