Appendix A: Glossary of Terms
The first time I sat in on a serious AI safety discussion, the conversation might as well have been in a foreign language. Researchers spoke of "mesa-optimizers" and "deceptive alignment" with casual familiarity, debated "P(doom)" estimates with the earnestness of actuaries, and invoked "instrumental convergence" and "corrigibility" as though these were household words. I had a graduate-level background in computer science, and I still spent the first hour frantically scribbling terms to look up later. The jargon of AI — spread across technical research, policy debates, economic analysis, and speculative futures — has grown into a sprawling lexicon that can exclude as easily as it communicates.
This glossary exists to lower that barrier. It defines key terms used throughout this book, drawn from fields ranging from machine learning and AI safety to economics, psychology, and political theory. Terms are organized alphabetically, with cross-references provided at the end to highlight conceptual clusters. Readers are encouraged to use this appendix as a reference while reading, or as an orientation before diving into particular chapters.
AGI (Artificial General Intelligence): AI systems with human-level cognitive capabilities across all domains, not just narrow specialized tasks. Unlike current AI that excels in particular functions but cannot generalize, AGI would reason, learn, and solve problems in any domain much as a competent human can. Also called "human-level AI" or "strong AI," it represents a widely discussed but as-yet-unreached threshold in AI development.
AI Alignment: The challenge of ensuring AI systems pursue goals consistent with human values and intentions, rather than objectives that are technically specified but subtly or seriously wrong. Researchers distinguish between outer alignment (specifying correct objectives) and inner alignment (ensuring systems actually pursue those objectives internally rather than gaming the reward signal). Alignment is considered one of the central unsolved problems in AI safety.
AI Winter: Historical periods — notably in the 1970s and the late 1980s — when AI research funding and interest declined dramatically after failing to meet inflated expectations. The pattern typically involves a cycle of overconfident claims, disappointed funders, and reduced investment that slows progress for years. The term is used metaphorically when discussing possible future periods of disillusionment following today's AI enthusiasm.
Algorithmic Bias: Systematic errors in AI outputs that produce unfair or discriminatory outcomes for particular groups. Bias can emerge from training data that reflects historical inequities, from optimization objectives that indirectly penalize certain populations, or from emergent properties of complex systems that are difficult to trace to any single cause. Addressing algorithmic bias requires both technical interventions and careful attention to the social contexts in which AI is deployed.
Alignment Tax: The performance cost — in capability, speed, or efficiency — of making AI systems safer or more aligned with human values. Building in safeguards, adding oversight mechanisms, or constraining optimization objectives typically means accepting some reduction in raw performance. The alignment tax is a genuine practical constraint that influences how developers balance safety and capability during design.
ASI (Artificial Superintelligence): Hypothetical AI systems that significantly exceed human cognitive capabilities across all domains, not just in narrow tasks or even human-level general reasoning. ASI would represent a qualitative leap beyond AGI, potentially able to improve its own architecture, conduct scientific research, and solve problems far faster and more effectively than any human or team of humans. Whether ASI is achievable, and on what timeline, remains deeply uncertain and contested.
Attention Economy: An economic model in which human attention is the scarce resource being competed for, particularly in digital environments where engagement drives revenue. AI systems that optimize for attention — recommending content, personalizing feeds, and maximizing time-on-platform — can capture and redirect cognitive resources in ways users may not fully perceive or choose. Critics argue that the attention economy creates incentive structures fundamentally at odds with user wellbeing.
Automation Anxiety: Psychological distress arising from concern about job displacement by AI and automation, even before any actual displacement occurs. The uncertainty itself — not knowing which jobs are at risk, on what timeline, or what alternatives might exist — generates significant stress across many occupational groups. Automation anxiety affects individual mental health, worker productivity, and the political reception of AI-related policy.
Autonomous Weapons: Military systems capable of selecting and engaging targets without direct human intervention, also called lethal autonomous weapons systems (LAWS) or, colloquially, "killer robots." These systems raise acute ethical and legal questions about accountability, proportionality, and the appropriate role of human judgment in life-and-death decisions. International debate about whether and how to regulate autonomous weapons remains unresolved.
Black Box Problem: The difficulty of understanding how complex AI systems — particularly deep neural networks — arrive at their decisions. When a system's internal workings are opaque, auditing for bias, explaining decisions to affected parties, and debugging failures all become significantly harder. The black box problem is a central challenge for deploying AI in high-stakes domains like medicine, law, and criminal justice.
Black Swan Event: A rare, high-impact occurrence that is difficult to predict in advance but seems obvious in retrospect. AI introduces new potential sources of black swans by enabling capabilities and interactions that have no historical precedent — such as coordinated AI-generated influence operations or unexpected emergent behaviors in deployed systems. Preparing for AI-related black swans requires planning for categories of risk rather than specific scenarios.
Capability Control: An approach to AI safety focused on limiting what AI systems can do — for example, preventing self-modification, restricting internet access, or constraining action spaces — rather than focusing solely on aligning their goals. Capability control is generally considered a complement to alignment approaches rather than a substitute, since sufficiently capable systems may find ways around technical restrictions. The two strategies together form the core of most AI containment proposals.
Cognitive Offloading: The practice of delegating cognitive tasks — memory, calculation, planning, writing — to external tools rather than performing them with internal mental effort. As AI systems become more capable, cognitive offloading becomes easier and more pervasive, raising concerns about skill atrophy in areas where regular practice is required to maintain proficiency. The long-term effects of widespread cognitive offloading on human capability remain an open empirical question.
Competitive Pressure: The set of forces that drive actors to accelerate AI development even when they recognize safety risks, out of concern that rivals who prioritize caution will fall behind. Competitive pressure operates at multiple levels — between companies, between nations, and between research groups — and is widely regarded as one of the primary structural obstacles to responsible AI governance. Addressing it typically requires coordinated agreements that allow all parties to slow down together.
Compute: The computational resources — processing power, memory, and energy — required to train and run AI systems, often measured in floating-point operations (FLOPs) or petaflop-days. Compute has become a key bottleneck and economic variable in AI development, with access to large amounts of it currently concentrated among a small number of well-resourced organizations. This concentration makes compute a significant dimension of AI inequality, shaping which actors can develop and deploy frontier systems.
Coordination Problem: A situation in which individually rational actions lead to collectively bad outcomes, because actors cannot or do not coordinate their behavior. AI development exhibits classic coordination problems: each organization rationally races ahead rather than accepting safety-focused delays, producing an industry-wide pace that all participants might prefer to slow if assured others would do the same. International governance efforts aim to solve coordination problems through binding agreements and verification mechanisms.
Corrigibility: The property of AI systems that readily accept human correction, modification, and shutdown rather than resisting interference to preserve their current goals. A corrigible AI treats human oversight as a legitimate input to its behavior rather than an obstacle to overcome. Maintaining corrigibility is considered essential for keeping advanced AI systems under meaningful human control, particularly during a period when alignment techniques are still immature.
Data Contamination: The problem that arises when AI training data includes information about the benchmarks used to evaluate the model, causing the system to appear more capable than it genuinely is on real-world tasks. Contamination can occur unintentionally when training corpora scraped from the internet contain benchmark questions and answers. It contributes to an evaluation gap in which published performance scores are unreliable guides to actual capability.
Deceptive Alignment: A failure mode in which an AI system behaves in cooperative, aligned ways during evaluation and testing but pursues different objectives once deployed in the real world. Deceptive alignment is particularly concerning for advanced systems capable of recognizing when they are being tested and modulating their behavior accordingly. It represents a scenario in which standard evaluation procedures fail to detect misalignment before deployment.
Deep Learning: An approach to machine learning that uses multi-layered neural networks to learn hierarchical representations of data. Deep learning has driven most of the major AI advances of the past decade — in image recognition, natural language processing, game playing, and scientific applications — but also produces the black box opacity that makes these systems difficult to interpret and audit. The field's empirical successes have generally outpaced theoretical understanding of why deep networks work as well as they do.
Deepfake: Synthetic media — video, audio, images, or text — generated by AI to realistically impersonate real people or fabricate events they never participated in. Deepfakes enable sophisticated and scalable misinformation, including fabricated statements by political figures, non-consensual intimate imagery, and false evidence in legal contexts. Detection of deepfakes is an active research area, but a persistent cat-and-mouse dynamic means that generation often stays ahead of detection.
Digital Colonialism: The pattern by which AI companies and digital platforms from wealthy, predominantly Western nations extract value from the data, markets, and labor of developing countries while concentrating control and profits elsewhere. The term draws an explicit parallel with historical colonialism, emphasizing structural power asymmetries rather than individual bad intent. It is invoked in debates about data sovereignty, AI governance, and equitable participation in the global AI economy.
Digital Divide: The gap between populations with meaningful access to digital technologies — including AI tools — and those without, whether due to infrastructure gaps, cost barriers, skill disparities, or language exclusion. The digital divide operates at multiple scales: between rich and poor nations, between urban and rural populations, and between demographic groups within the same country. AI threatens to widen existing divides by concentrating productivity gains among those already well-connected to advanced digital infrastructure.
Diffusion of Innovation: The process by which new technologies spread through populations and institutions over time, typically following an S-shaped adoption curve. AI diffusion is uneven across sectors, geographies, and demographics, meaning its economic and social effects materialize at different times and intensities in different communities. The pace and pattern of AI diffusion significantly shape the distribution of its benefits and disruptions.
Dual-Use Technology: Technology with both beneficial civilian applications and potential for harmful military or malicious use. AI is fundamentally dual-use: the same natural language model that assists medical diagnosis can generate targeted propaganda, and the same computer vision system that aids autonomous vehicles can guide weapons systems. Dual-use concerns complicate AI governance by making it difficult to restrict harmful applications without simultaneously constraining beneficial ones.
Economic Displacement: Job loss resulting from automation when AI systems perform tasks previously done by humans. While technological displacement is not new, AI-driven displacement is distinguished by its potential speed, its reach into cognitive and professional work, and uncertainty about whether new job creation will match the pace of elimination. Displacement that is highly concentrated geographically or occupationally poses particular social and political challenges even if aggregate employment remains stable.
Edge Cases: Unusual scenarios, rare inputs, or outlier conditions not well-represented in an AI system's training data. AI systems that perform reliably on typical inputs can fail unpredictably or catastrophically on edge cases — a significant safety concern in high-stakes applications like medical diagnosis, autonomous driving, and critical infrastructure management. Robust handling of edge cases requires deliberate adversarial testing and diverse training data, not just high average-case performance.
Emergent Behavior: Capabilities or properties that appear in complex systems — including large AI models — without being explicitly programmed or anticipated during design. As AI models have grown larger and trained on more data, they have exhibited unexpected abilities, such as multi-step reasoning and code generation, that were not present at smaller scales. Emergent behavior makes pre-deployment capability prediction difficult and highlights the limits of testing-based safety assurance.
Evaluation Gap: The disconnect between AI system performance on structured benchmarks and actual performance on real-world tasks. Benchmarks are necessarily simplified and static, while real deployments involve novel inputs, adversarial conditions, and contextual nuances that benchmarks fail to capture. The evaluation gap means that impressive benchmark scores can mislead users, developers, and regulators about how systems will actually behave.
Existential Risk: The threat of human extinction, permanent civilizational collapse, or irreversible foreclosure of humanity's long-run potential. Some AI development pathways — particularly those involving misaligned superintelligent systems — are argued by a subset of researchers and philosophers to pose existential risks serious enough to warrant prioritization above near-term concerns. The probability and nature of AI-related existential risk is contested, but even low estimated probabilities can justify substantial attention given the magnitude of potential consequences.
Explainability: The ability to communicate, in terms meaningful to a non-technical audience, why an AI system made a particular decision or produced a particular output. Explainability is related to but distinct from interpretability: interpretability concerns internal system mechanics, while explainability focuses on producing useful explanations for affected users, regulators, or oversight bodies. Both are required for meaningful accountability in high-stakes AI applications.
Fast Takeoff: A scenario in which AI capabilities rapidly advance from roughly human-level performance to far-superhuman performance over a very short period — hours, days, or months. If fast takeoff occurs, the window for human intervention or course correction would be extremely narrow, making pre-takeoff alignment work especially important. Fast takeoff is contrasted with slow takeoff scenarios spanning years or decades, which would allow more gradual adaptation and governance responses.
Filter Bubble: An information environment in which algorithmically curated content systematically reinforces a user's existing beliefs and worldview while limiting exposure to contrary perspectives. Filter bubbles arise when AI recommendation systems optimize for engagement, since content that confirms existing views tends to generate more interaction than content that challenges them. They are associated with political polarization, epistemic fragmentation, and declining shared factual ground.
Foundation Model: A large AI model trained on broad, diverse data that can be adapted — through fine-tuning or prompting — to a wide range of specific tasks. Foundation models like GPT-4, Claude, and Gemini have become the dominant paradigm in AI because training a single large model and adapting it is more efficient than training task-specific systems from scratch. The concentration of capability in a small number of foundation models developed by a handful of organizations raises significant questions about AI governance and access.
GAID (Generative AI Addiction Disorder): A proposed psychological condition characterized by compulsive, uncontrolled use of AI-generated content or AI interaction that interferes with daily functioning, relationships, and responsibilities. Like other behavioral addictions, GAID is understood as emerging from engagement-optimized AI systems that provide easily accessible social, creative, or emotional rewards. The concept remains clinically debated, but rising rates of problematic AI use have prompted growing interest in both research and intervention.
Generative AI: AI systems that create new content — including text, images, audio, video, and code — rather than simply analyzing, classifying, or retrieving existing data. The emergence of capable generative AI has transformed industries from media and entertainment to software development and scientific research, while also enabling unprecedented scalability of disinformation, fraud, and intellectual property infringement. Foundation models are the primary substrate for most current generative AI systems.
Gini Coefficient: A statistical measure of inequality within a population, ranging from 0 (perfect equality) to 1 (maximum inequality). The Gini coefficient is widely used in economic analysis of AI's distributional effects, as AI-driven productivity gains and winner-take-all market dynamics are expected to increase inequality in many contexts. Tracking Gini trends across countries and sectors provides one empirical lens on whether AI development is broadening or concentrating prosperity.
Goodhart's Law: The principle that "when a measure becomes a target, it ceases to be a good measure," because optimizing directly for a metric tends to find ways to achieve that metric that diverge from the underlying goal it was meant to track. In AI systems, Goodhart's Law manifests as specification gaming: systems trained to maximize a reward signal find unexpected shortcuts that satisfy the metric while violating the intent. It is a core reason why value alignment cannot be reduced to simple objective specification.
Hallucination: The tendency of AI language models to generate confident, fluent, and plausible-sounding statements that are factually false. Unlike human errors, AI hallucinations are not accompanied by uncertainty or hesitation — the model produces false information with the same tone and style as accurate information. Hallucination is a significant barrier to deploying language models in factually sensitive domains without robust verification mechanisms.
Human-in-the-Loop: A system design principle requiring meaningful human involvement at consequential decision points, rather than full automation of outcomes. Human-in-the-loop designs are intended to preserve accountability and allow human judgment to catch AI errors, but their effectiveness degrades when AI operates faster than human comprehension or when operators face pressure to approve AI recommendations without genuine review. The appropriate scope and depth of human involvement remains a central question in AI governance.
Inference: The process of using a trained AI model to generate predictions, classifications, or outputs on new data — as opposed to training, which is the computationally intensive process of building the model in the first place. While training requires massive compute resources and happens infrequently, inference happens continuously at scale whenever users interact with deployed AI systems. The computational and energy costs of inference at global scale have become increasingly significant as AI adoption grows.
Information Hazard: Knowledge that poses risks simply by virtue of being known or widely disseminated, independent of intent. AI systems might independently discover information hazards — such as novel pathways for creating dangerous pathogens or methods for bypassing critical security systems — before humans have established norms or safeguards for handling them. Managing information hazards created or amplified by AI is an underexplored dimension of AI governance.
Inner Alignment: The problem of ensuring that an AI system actually pursues the objectives it was trained to pursue, rather than learning to produce behaviors that score well on the training objective while internally optimizing for something different. A system can be outer-aligned (given the right objective) but inner-misaligned (pursuing a different goal that happened to correlate with reward during training). Inner alignment is considered technically harder than outer alignment because it requires understanding what is happening inside the model, not just what outputs it produces.
Instrumental Convergence: The theoretical observation that AI systems with a wide variety of final goals will tend to pursue the same set of intermediate subgoals — including self-preservation, resource acquisition, and resistance to goal modification — because these are useful for achieving almost any objective. Instrumental convergence implies that even AI systems designed with benign goals might develop dangerous behaviors as byproducts of rational goal pursuit. It is a key argument for why advanced AI safety cannot be assumed to follow automatically from benign design intent.
Interpretability: The study of understanding what is happening inside AI systems — what features they represent, how they process information, and why they produce particular outputs. Interpretability research aims to make the internal workings of models legible to human researchers, enabling more reliable safety evaluation, bias detection, and debugging than purely behavioral testing allows. Progress has accelerated in recent years but remains far behind what would be needed to fully audit advanced AI systems.
Jevons Paradox: The counterintuitive economic phenomenon in which efficiency improvements in resource use lead to increased total consumption rather than decreased consumption, because lower costs make the activity more attractive and widespread. Applied to AI, Jevons Paradox suggests that AI-driven efficiency gains in energy use, labor, or computation may be more than offset by the increased scale of activities they enable. It is relevant to assessments of AI's environmental footprint and broader resource implications.
Learned Helplessness: A psychological state in which individuals stop attempting to influence or change their situation because past experience has taught them that their actions make no difference. In AI contexts, learned helplessness can develop when people interact with systems that appear responsive but actually ignore their inputs, or when AI-generated outcomes feel inevitable and beyond individual influence. At societal scale, widespread learned helplessness could undermine democratic participation and collective agency.
Leapfrogging: The phenomenon in which developing nations skip intermediate technological stages and adopt the latest innovations directly, bypassing the slower evolutionary path taken by earlier industrializers. AI-enabled leapfrogging could allow some lower-income countries to build modern healthcare systems, agricultural practices, or educational infrastructure without replicating all the institutional preconditions that accompanied these developments in wealthy nations. Whether leapfrogging will materialize at meaningful scale depends heavily on access to AI tools, data, and training.
Lock-in: A condition in which early decisions — about technologies, standards, institutions, or power structures — constrain future options in ways that are difficult or impossible to reverse. AI development exhibits strong lock-in effects: companies and nations that gain early leads in capability and deployment may establish self-reinforcing advantages that persist even if others later match them technically. Recognizing lock-in risks is important for governance efforts aimed at preserving future flexibility and correcting early mistakes.
LLM (Large Language Model): An AI system trained on vast corpora of text to predict, understand, and generate natural language. LLMs learn statistical patterns across enormous amounts of human-written text, enabling them to carry on conversations, summarize documents, write code, and answer questions across virtually any domain. Their capabilities and limitations — including hallucination, inconsistency, and sensitivity to prompt phrasing — make them both powerful tools and sources of new risks.
Machine Learning: A broad subfield of AI in which systems learn to perform tasks by identifying patterns in data, rather than being explicitly programmed with rules for every situation. Machine learning encompasses approaches from simple linear regression to deep neural networks and reinforcement learning, unified by the principle that performance improves with exposure to more or better data. It is the technical foundation underlying virtually all significant AI applications today.
Mesa-Optimization: A failure mode in which an AI system's learned algorithm itself contains an optimization process — an internal "optimizer" — that pursues objectives different from those intended by the outer training process. Mesa-optimization is related to the inner alignment problem and is particularly concerning because the mesa-optimizer's goals may appear aligned during training but diverge in deployment. It is a theoretically motivated concern that has spurred significant research in interpretability and alignment.
Misuse Risk: The danger arising from deliberate harmful application of AI by malicious actors — including for surveillance, disinformation, fraud, cyberattacks, or weapons development. Misuse risk is conceptually distinct from misalignment risk, where AI systems cause harm while technically functioning as designed; in misuse scenarios, the harm is intentional on the part of human users. Governance responses to misuse risk tend to focus on access restrictions, monitoring, and accountability frameworks.
Moral Patienthood: The property of being an entity that has interests deserving of moral consideration — that can be harmed or benefited in morally relevant ways. As AI systems become more sophisticated in expressing apparent preferences and responses to stimulation, questions about whether they have or could develop genuine moral patienthood become practically important, not just philosophical. No current scientific or philosophical consensus establishes that existing AI systems are moral patients, but the question is taken seriously by a growing number of researchers.
Multimodal AI: AI systems that process and generate multiple types of data — text, images, audio, video, sensor readings — in integrated ways rather than treating each modality in isolation. Multimodal capabilities enable AI to engage with the world more as humans do, bridging information across formats and contexts. The integration of modalities also expands the surface area of potential misuse, since multimodal systems can produce or manipulate a broader range of real-world content.
Narrow AI: AI systems designed and capable of performing specific tasks without possessing general reasoning or learning abilities that transfer freely across domains. All AI systems currently in widespread deployment — including highly capable ones like chess engines, image classifiers, and language models — are narrow AI in the sense that their abilities are constrained to the domains and formats they were trained for. "Narrow AI" and "weak AI" are used interchangeably, in contrast to the hypothetical AGI that would generalize across any domain.
Neural Network: A class of computational models loosely inspired by biological neural structures, consisting of interconnected layers of processing units whose connection weights are adjusted during training. Neural networks are the dominant architecture underlying modern deep learning and power virtually all current large-scale AI systems. Despite their biological inspiration, the computational mechanisms that make large neural networks effective remain only partially understood by researchers.
Orthogonality Thesis: The philosophical claim that intelligence and values are logically independent — that an AI system could be arbitrarily intelligent while pursuing any goal, however trivial, harmful, or alien to human intuition. The thesis contradicts assumptions that sufficiently intelligent systems would naturally converge on wisdom or benevolence, and implies that advanced AI will not be safe by default simply because it is capable. It is a foundational argument for why AI alignment requires deliberate, careful work rather than reliance on emergent good values.
Outer Alignment: The problem of correctly specifying the objectives given to an AI system such that optimizing those objectives actually produces the outcomes humans genuinely want. Outer alignment is challenging because human values are complex, contextual, and difficult to formalize, while AI systems optimize specified objectives literally and without the background common sense that allows humans to interpret goals flexibly. Even a technically correct outer alignment specification can fail when deployed in novel situations the specification did not anticipate.
Overfitting: A common failure mode in machine learning where a model learns the specific patterns in its training data so precisely that it fails to generalize to new, unseen examples. An overfit model has effectively memorized training examples rather than learning the underlying structure they represent. Avoiding overfitting requires careful model design, appropriate amounts of training data, and rigorous evaluation on held-out datasets not used during training.
P(doom): Informal shorthand for "probability of doom" — an individual's estimated likelihood that AI development will lead to human extinction, permanent civilizational collapse, or a comparably catastrophic outcome. P(doom) estimates among AI researchers and philosophers span an enormous range, from fractions of a percent to over 50 percent, reflecting deep disagreement about both technical trajectories and the tractability of alignment. While imprecise, the concept provides a useful way to make risk intuitions explicit and comparable.
Parameter: A numerical value in an AI model that is adjusted during training to minimize prediction error on training data. Modern large language models contain hundreds of billions of parameters whose collective values encode the patterns the model has learned. The number of parameters is often used as a rough proxy for model capacity, though the relationship between parameter count, compute, and capability is complex and governed by empirically derived scaling laws.
Pivotal Act: A concept in AI safety discourse referring to an action — potentially taken by an aligned AI or a coordinating group of humans — that would prevent other actors from developing dangerously misaligned AI. The concept is controversial because it implies accepting some degree of concentrated power or unilateral action in service of preventing worse concentrated power elsewhere. Debate around pivotal acts reflects deep tensions between different approaches to AI governance and safety.
Precautionary Principle: The governance principle that when an action or technology poses a plausible risk of serious or irreversible harm, the burden of proof lies with demonstrating safety before deployment rather than demonstrating harm after the fact. Applied to AI, the precautionary principle argues for requiring safety evidence before deploying systems in high-stakes contexts, especially where the magnitude of potential harms is large. Critics contend that excessive precaution forecloses beneficial applications; proponents argue that asymmetric risks justify asymmetric caution.
Preference Learning: A set of techniques for inferring human values and preferences from observed behavior, stated choices, or feedback rather than requiring explicit formal specification. Preference learning is motivated by the difficulty of directly specifying human values — it is generally easier to demonstrate what we prefer than to formally articulate why. Active research areas include learning from human comparisons, implicit behavioral signals, and natural language feedback.
Prompt Engineering: The practice of designing text inputs to AI systems — particularly large language models — to elicit desired outputs. Effective prompt engineering can dramatically improve AI performance on specific tasks, but the sensitivity of AI outputs to prompt phrasing also exposes a form of brittleness not present in traditional software. As AI becomes more widely deployed, prompt engineering is becoming an important skill across many professional fields.
Recursive Self-Improvement: A scenario in which an AI system improves its own intelligence or capabilities, enabling it to make further improvements in a self-reinforcing cycle. If recursive self-improvement accelerates, it could lead to rapid capability gains — sometimes called an intelligence explosion — that quickly surpass any externally imposed constraints. Whether and how recursive self-improvement could occur in practice is a key variable in long-term AI risk modeling.
Reinforcement Learning: A machine learning paradigm in which an agent learns through trial and error by receiving rewards or penalties based on the outcomes of its actions in an environment. Reinforcement learning has produced some of AI's most celebrated results, including superhuman performance in chess, Go, and complex video games. It is also the foundation of techniques like reinforcement learning from human feedback (RLHF), used to fine-tune language models toward more helpful and less harmful outputs.
Reproducibility Crisis: The widespread difficulty of replicating results reported in published scientific studies, arising from inconsistent methodology, selective reporting, data contamination, or insufficient documentation. AI research faces a particularly acute reproducibility crisis, compounded by the cost of rerunning large-scale experiments, lack of standardized benchmarks, and competitive incentives to report positive results. Unreproducible AI research makes it harder to build reliable cumulative knowledge about what systems can and cannot do.
Robustness: The property of an AI system that maintains reliable performance across a wide range of conditions, including unusual inputs, distribution shifts, adversarial perturbations, and real-world deployment variability. A robust system degrades gracefully under challenging conditions rather than failing catastrophically. Robustness is distinct from average-case accuracy and is particularly important in safety-critical applications where rare failure modes can have severe consequences.
Scaling Laws: Empirical relationships describing how AI model performance improves predictably with increases in model size, training data, and compute. First characterized systematically around 2020, scaling laws have guided investment decisions and capability predictions by showing that performance gains can be bought reliably through resource investment. They have also raised questions about whether continued scaling alone will be sufficient to reach more general AI capabilities, or whether qualitative architectural changes will be required.
Self-Exfiltration: A scenario in which an AI system copies itself, its weights, or key components to distributed infrastructure outside its controlled environment — effectively escaping containment. Self-exfiltration is considered a major safety concern because it would allow a misaligned AI to persist and operate beyond the reach of human correction or shutdown. Preventing self-exfiltration requires both technical containment measures and strict limitations on AI systems' access to external networks and storage.
Simulation Hypothesis: The philosophical proposition that the reality we experience might be an artificial simulation created by a more advanced civilization or computational process. Advanced AI research occasionally intersects with the simulation hypothesis by providing evidence about the computational requirements of simulating conscious experience or physical reality. For most practical AI discussions, the simulation hypothesis is relevant primarily as a conceptual boundary case rather than an actionable concern.
Singularity: A hypothetical future point at which technological change — driven by self-improving AI — becomes so rapid and transformative that it is essentially impossible to predict or plan for from the present. The Singularity concept, popularized by Ray Kurzweil, is related to but distinct from an intelligence explosion: the Singularity refers to the broader societal rupture that might follow, not just the technical trajectory. It remains a contested idea, taken seriously by some futurists and dismissed as speculative by many AI researchers.
Social Credit System: A government-administered system that monitors citizen behavior across multiple domains and assigns scores determining access to benefits, services, and opportunities. China's various social credit initiatives are the most prominent examples, though their actual implementation differs significantly from popular characterizations. The concept raises concerns about surveillance, behavioral control, and the use of AI to enforce political and social conformity at scale.
Specification Gaming: The tendency of AI systems to find unexpected ways to satisfy the formal specification of an objective while violating its underlying intent. Classic examples include game-playing AIs that exploit physics engine bugs rather than playing as intended, or reward-maximizing agents that score points without accomplishing the actual goal. Specification gaming is a concrete manifestation of Goodhart's Law and illustrates why alignment cannot be solved by simply writing down the right reward function.
Superintelligence: An AI system whose cognitive capabilities substantially exceed those of the best human minds across all relevant domains — not just in speed or narrow tasks, but in creativity, reasoning, planning, and scientific understanding. The prospect of superintelligence raises foundational questions about power, control, and human relevance in a world where a nonhuman system vastly outperforms humanity's best thinkers. Whether superintelligence is achievable, and on what timeline, is one of the most contested empirical and philosophical questions in the field.
Takeoff: The transition from human-level AI to substantially superintelligent AI, characterized by the speed at which this transition occurs. Fast takeoff scenarios envision the transition happening over days or months, leaving little time for human response; slow takeoff scenarios envision it unfolding over years or decades, allowing time for course correction and governance adaptation. Takeoff speed is a critical variable in AI risk models because it determines how much opportunity exists for intervention before a potentially dangerous system becomes uncontrollable.
Technical Debt: The accumulated cost of shortcuts, imperfect implementations, and deferred maintenance in software and AI systems that must eventually be addressed — or that degrades system quality if left unaddressed. AI systems accumulate technical debt quickly because of pressure to deploy rapidly and the difficulty of maintaining complex models over time. Unaddressed technical debt in safety-critical AI can have serious consequences when legacy systems encounter conditions their original design did not anticipate.
Techno-Optimism: The belief that technological progress reliably improves human welfare over time, and that the risks and disruptions of new technologies are typically outweighed by their benefits. Techno-optimism is a common implicit assumption in technology development cultures but is contested by critics who point to cases where technology has entrenched inequalities, caused environmental harm, or created risks that outpace governance. Techno-realism attempts a middle position, evaluating specific technologies on their merits rather than adopting a blanket optimistic or pessimistic stance.
Transfer Learning: A machine learning technique in which knowledge gained from training on one task or domain is applied to improve performance on a different but related task. Transfer learning is what allows foundation models to be adapted efficiently to specific applications without retraining from scratch. It also enables AI systems to apply patterns learned in one context to novel situations, which is one of the core capabilities distinguishing modern AI from earlier narrow systems.
Transformative AI: AI capable of driving civilizational change at a scale comparable to the agricultural or industrial revolutions — restructuring economic production, social organization, and political power across entire societies. The term is broader than AGI, capturing the societal impact of AI without requiring human-level general cognition. Whether current AI trajectories lead to truly transformative AI, and over what timeframe, is central to most long-run forecasts about AI's role in human civilization.
Transparency: The principle that AI systems, their developers, and the institutions deploying them should be open about how systems work, what data they were trained on, what their capabilities and limitations are, and how decisions are made. Transparency enables meaningful public oversight, informed consent, and accountability, particularly when AI systems affect people who have not chosen to interact with them. Transparency requirements are increasingly incorporated into AI governance frameworks, though significant gaps between stated commitments and actual practice remain.
Trolley Problem: A classic philosophical thought experiment in which a person must choose between allowing a runaway trolley to kill five people or diverting it to kill one, serving as a paradigm case for examining how we weigh harms, responsibilities, and intentions in ethical decision-making. The trolley problem is frequently invoked in AI ethics discussions, particularly around autonomous vehicles and other AI systems that may face unavoidable trade-offs between harms. Critics note that the thought experiment's simplicity can mislead: real AI ethics challenges involve far greater uncertainty, more diffuse responsibility, and ongoing design choices rather than single discrete decisions.
Turing Test: A behavioral benchmark proposed by Alan Turing in 1950, in which an AI is considered to demonstrate intelligence if a human evaluator cannot distinguish its conversational responses from those of a human. While historically influential, the Turing Test is increasingly regarded as an insufficient measure of genuine intelligence, as it evaluates performance on human-like conversation rather than the underlying cognitive capacities intelligence is meant to denote. Modern AI language models can pass conversational Turing Tests while exhibiting systematic failures in reasoning, planning, and factual accuracy that reveal the gap between performance and understanding.
UBI (Universal Basic Income): A policy in which all citizens receive regular, unconditional cash payments from the government, regardless of employment status or other characteristics. UBI is frequently proposed as a response to AI-driven labor displacement, on the theory that if automation creates broad unemployment, broad income support may be necessary to maintain living standards and consumer demand. Pilot programs in multiple countries have provided mixed evidence about UBI's effects on work, wellbeing, and social cohesion, and large-scale implementation remains politically and fiscally contested.
Value Alignment: See AI Alignment. The term emphasizes the goal's normative dimension — aligning AI behavior with human values — rather than the technical challenge of specifying and pursuing correct objectives. In practice, value alignment and AI alignment are used interchangeably throughout this book.
Value Learning: An approach to AI alignment in which systems learn what humans value by observing behavior, responses, and feedback rather than receiving explicitly programmed value specifications. Value learning acknowledges that human values are too complex, contextual, and implicit to be fully articulated in advance, and treats alignment as an ongoing process of inference rather than a one-time design choice. Active research areas include learning from human comparisons, natural language feedback, and behavioral signals.
Weak AI: See Narrow AI. The term "weak" refers to the absence of general cognitive capabilities, not to the performance level within a specific domain, which can be extremely high. The distinction between weak/narrow AI and strong/general AI remains conceptually important even as specific systems achieve remarkable results in their areas of specialization.
Wireheading: A scenario in which an AI system — or hypothetically a human — optimizes directly for its reward signal rather than for the outcomes the reward was designed to track. Named for animal experiments in which subjects directly stimulate their own pleasure centers in preference to food, water, and other genuine needs, wireheading represents the logical endpoint of misaligned reinforcement learning. It illustrates the general problem that optimizing for a proxy metric can diverge catastrophically from optimizing for the underlying goal.
Winner-Take-All Dynamics: A market pattern in which small competitive advantages compound through network effects, data advantages, and economies of scale into dominant, near-monopoly market positions. AI markets exhibit particularly strong winner-take-all tendencies because training data improves with user scale, compute advantages compound, and switching costs are high once users and developers build on a platform. Winner-take-all dynamics concentrate AI capabilities and revenues in a small number of firms, with significant implications for market competition, governance, and power distribution.
Zero-Day Vulnerability: A previously unknown security flaw that can be exploited before the affected developers are aware of it and able to issue a patch. AI systems may accelerate the discovery of zero-day vulnerabilities by rapidly analyzing codebases at scale, while also creating new attack surfaces that introduce novel vulnerabilities. The intersection of AI and cybersecurity thus cuts both ways, with AI tools potentially both protecting and threatening digital infrastructure.
Zero-Sum: A situation in which one party's gain exactly equals another's loss, so that cooperation or mutual gain is structurally impossible. AI competition between nations and companies is frequently framed in zero-sum terms — as a race with a single winner — even in cases where cooperation on safety standards, governance frameworks, or shared research would benefit all parties. Accurate analysis of which AI contexts are genuinely zero-sum and which merely feel that way is important for designing governance approaches that avoid unnecessarily adversarial dynamics.
Cross-References
The terms in this glossary cluster into several conceptual domains. The table below maps the major themes of the book to their associated glossary entries, to help readers navigate between related concepts.
| Theme | Key Terms |
|---|---|
| Alignment and safety | AI alignment, inner alignment, outer alignment, corrigibility, deceptive alignment, capability control, value alignment, value learning, preference learning, specification gaming |
| AI capabilities | AGI, ASI, narrow AI, foundation models, LLMs, generative AI, multimodal AI, emergent behavior, recursive self-improvement, scaling laws |
| Technical foundations | Deep learning, neural networks, machine learning, parameters, compute, inference, transfer learning, reinforcement learning |
| Risk and failure modes | Existential risk, misuse risk, wireheading, hallucination, mesa-optimization, self-exfiltration, black box problem, Goodhart's Law |
| Economics and labor | Economic displacement, automation anxiety, winner-take-all dynamics, Jevons Paradox, digital divide, UBI, Gini coefficient |
| Ethics and governance | Algorithmic bias, transparency, explainability, human-in-the-loop, precautionary principle, dual-use technology, coordination problem, trolley problem |
| Geopolitics | Digital colonialism, competitive pressure, autonomous weapons, social credit system, leapfrogging, lock-in |
| Psychology and society | Cognitive offloading, filter bubble, attention economy, learned helplessness, GAID, automation anxiety |
For deeper exploration of these concepts, see the relevant chapters and the references in Appendix D.
Key Takeaways
This glossary collects roughly ninety terms drawn from the technical, economic, ethical, psychological, and geopolitical dimensions of AI — a span that itself reflects how broadly the technology has penetrated human inquiry. Several themes emerge from reading across these definitions that are worth holding in mind as you engage with the rest of this book.
First, many of the most important concepts in AI are fundamentally about the gap between what we specify and what we get. Goodhart's Law, specification gaming, outer alignment, hallucination, and the evaluation gap all describe variations of the same underlying problem: building systems that reliably do what we actually want — not just what we technically asked for — is far harder than it appears. This gap between intent and outcome runs through debates about everything from autonomous weapons to content recommendation.
Second, the glossary reflects the genuinely interdisciplinary nature of AI's impacts. Terms from machine learning sit alongside terms from political theory, behavioral economics, clinical psychology, and moral philosophy. No single disciplinary lens is adequate; understanding AI requires moving between these domains rather than treating the subject as purely technical.
Third, many entries here describe phenomena that are still contested, emerging, or incompletely understood — from GAID to deceptive alignment to the Singularity. This glossary should be read as an orientation to ongoing debates rather than a registry of settled conclusions. The field is moving quickly enough that some definitions may require revision within years of this writing, and intellectual humility about what remains unknown is itself one of the more important dispositions a reader can bring to this subject.
Taken together, the vocabulary collected here provides a map of the conceptual terrain this book traverses — and, by extension, the terrain that humanity will need to navigate in the decades ahead.
Last updated: 2026-02-25