Embodied AGI: Navigating challenges through AEAI
AEAI seeks an AI that truly interacts with the physical world. Photo: TUCHONG
Over the past 70 years, researchers have dramatically expanded the horizons of artificial intelligence (AI). As a potential pathway toward artificial general intelligence (AGI), recent academic and industrial efforts have sought to refine generative AI into foundational infrastructure for AI development. By emphasizing agentic action and advocating for “Affordance-Enactive AI” (AEAI)—centered on “local world models,” “ecological niche generality,” and “autonomous agency”—we may chart a new course that moves beyond the prevailing fixation on monolithic, universal intelligence.
From affordance theory to AEAI
The dawn of embodied AI is upon us. Long before the emergence of concepts such as the “Embodied Turing Test” or “Embodied AGI,” scholars had long been exploring embodied cognition, forming the school of generative cognition. This tradition examines consciousness and cognition through the lens of “ecological niches.” At its core is James J. Gibson’s “affordances”—the actionable possibilities an environment provides to an agent. Gibson’s pioneering work revolutionized our understanding of perception, arguing that animals directly perceive opportunities for action through visual stimuli. For example, a chair “affords” sitting, while a door “affords” passage. Crucially, these affordances are not static; they depend on an observer’s physical posture, skills, and intentions. In AI, “affordances” and “optical flow,” provide a robust framework for modeling the intricate dance between intelligent agents and their surroundings. It has proven invaluable in fields like human-computer interaction, autonomous driving, multimodal perception, and the burgeoning domain of embodied intelligence.
Gibson originally coined “affordances” to underscore the decisive role of visual perception in enabling animal action, highlighting the tight coupling of perception and action. To pay tribute to his enduring influence on AI, Stanford AI Lab developed “iGibson,” a virtual environment designed to train agents in interactive tasks. Building on this foundation, Fei-Fei Li has identified embodied AI, visual reasoning, and scene understanding as critical “North Star” areas for future research. The goal is to equip machines with the ability to interpret 3D relationships from 2D scenes, decode social dynamics, and execute complex human tasks.
One promising solution is to embed large-scale models into intelligent agents. Multimodal generative AI, capable of processing diverse inputs, makes it feasible to forge a “language-visual-action” triangle—a seamless integration of comprehension and execution. A breakthrough in this direction came with “Voxposer,” an agent that marries affordance-based algorithms with generative AI technology. Hailed as a milestone, Voxposer demonstrates how large models can be integrated into physical entities in real-world scenarios. Without additional training, it infers an object’s affordance constraints from natural language instructions, translating the “language-visual-action” triad and complex directives into concrete action plans. The emergence of spatial intelligence further signals the industry’s deepening engagement with embodied agents.
However, our stance is that, while the AI community and media often fixate on the vision of “embodied AGI,” we believe humanity’s future needs AI agents that can freely interact with humans, mutually understand each other, coexist harmoniously, and act appropriately in real-world settings. If the conditions for embedding large language models with affordance algorithms to enable human-machine interaction are not clearly defined—as cautioned by the “embodied-general paradox” and its inherent risks—theoretical and practical obstacles will persist. Without overcoming these, embodied AGI may remain an unattainable utopia. So, what should our true aspirations be, and where might a theoretical consensus lie?
AEAI: Application across ecological niches
Based on the in-depth investigation of the underlying foundations of AI technology as mentioned above, we believe that future AI research should draw on the theoretical resources of generative cognition, shifting the focus from multimodal virtual generative content to generative actions that can interact with the outside world. In other words, the emphasis should be on embodied intelligent agents that adapt to their environment and master affordances.
To this end, we propose a new concept: Affordance-Enactive AI. The development goal of this AI is centered on actions within diverse, affordance-distributed scenarios, or “ecological niches.” According to the ecological psychology advocated by Gibson, an ecological niche describes the position a biological population occupies within an ecosystem. This concept also extends to the possible spatiotemporal conditions for interactions between organisms and their environment (including other organisms), in response to specific environmental factors, along with the information set of actionable survival resources. In AEAI, agents operate within these fields, constructing local world models and actively exploring suitable, universal action patterns suited to their surroundings. This capacity, dubbed “niche generality,” implies that intelligent agents must be able to extract real-time environmental information from their local world models and discern the action possibilities it affords. Perception here extends beyond physical sensory input to include the recognition of data patterns, social interactions, and even cultural contexts.
A local world model is an internal representation constructed by the intelligent agent, capturing the key characteristics and dynamics of its ecological field. This concept closely aligns with the predictive processing theory of the mind, which holds that the brain functions as a prediction machine. Through sensory input, it continually refines environmental forecasts to minimize discrepancies between perception and internal expectations. Within the AEAI framework, this theory clarifies how local world models are built—intelligent agents rely on internal simulations to anticipate dynamic changes in their ecological fields. This predictive capability allows them not only to adapt to their current environment but also to forecast future states, incorporating potential consequences into decision-making and facilitating more efficient problem-solving and task execution. As a predictive adjustment mechanism, the local world model enables intelligent agents to meet the universal demands of their ecological fields.
Self-optimization & free energy principle
Autonomous agency emphasizes an intelligent agent’s capacity for self-driven action, encompassing goal-setting, self-regulation, and self-optimization. This is closely tied to the free energy principle, which suggests that all living systems strive to minimize the free energy associated with their survival states—essentially reducing the prediction error between internal models and perceptual data. In AEAI, modeling based on this principle enables agents to refine their interactions with the environment by proactively adjusting their action. Rather than merely reacting to stimuli, autonomous agents actively seek optimal action strategies to enhance both their survival and functional effectiveness. Active inference, an application of the free energy principle, further underscores this process, as agents minimize prediction errors through action, thereby improving environmental adaptability.
AEAI facilitates autonomous agency and effective active inference by encouraging agents to actively explore and leverage affordance within ecological fields. The requirement for universal application in ecological contexts means that intelligent agents must function effectively across diverse environments. Through active inference, they not only recognize the affordances provided by their surroundings but also experiment with and implement the most suitable action strategies across different domains. This adaptability equips them to handle a broad spectrum of tasks and environmental conditions.
Predictive processing, free energy, and active inference thus intertwine within the AEAI framework, forming a cohesive theoretical foundation for agents to comprehend and navigate their ecological niches. This integrated approach not only deepens our understanding of agent behavior and cognition but also paves the way for designing AI systems that autonomously adapt to complex, ever-changing environments. By infusing AI with a precise, niche-adaptive “embodied” dimension, AEAI offers a practical vision that contrasts with the abstract pursuit of universal intelligence.
Ecological niches & local world models
Here, “body” transcends the traditional robotic form. It includes the open knowledge systems, software interfaces, and broader actor networks that agents depend on to act. In this sense, AEAI resembles a large model thriving on small, context-specific data. Unlike the illusory chase for an AGI “Holy Grail,” AEAI anchors itself in the tangible actions of AI agents, prioritizing embodied strategies that take root in reality. Affordances not only outline the possibilities and methods for flexible interaction with the external world but also serve as a guiding compass for agents within ecological niches.
Before agents engage the physical world, video games offer a valuable testing ground for ecological world models. These virtual environments can train agents to refine their skills in controlled yet dynamic settings. Looking ahead, embodied AI could evolve into predictive machines with survival instincts, leveraging active inference to make smarter decisions and take more appropriate actions in real-world contexts. This practical approach shifts the focus from theoretical grandeur to actionable progress, aligning AI development with the complexities of lived experience.
Future research, therefore, should not aim for a single, super-intelligent entity capable of all tasks. Instead, it should cultivate a suite of agents tailored to specific fields, making precise decisions and actions based on affordances and causal world models. This would realize a generative AI with authentic environmental understanding. If further explored, we can envision establishing a network of AEAI agents, each with distinct capabilities, collaborating across ecological fields to form a multi-tiered, intelligent system. Their collective intelligence might approximate “general intelligence” in scope, yet remain grounded in practical, context-specific applications.
This vision aligns with our consistent call to demystify the obsession with generality. AEAI seeks an AI that truly grasps the physical world, equipped with real-world models that reflect its nuances. Yet, in pursuing this, are we not crafting another “holy grail” of sorts?
Xue Shaohua is an associate professor from the School of Education at Beijing Institute of Technology. Liu Xiaoli is a professor form the School of Philosophy at Renmin University of China.
Edited by ZHAO YUAN