Embodied Alignment
‍A Measurement-Based Approach to AI Alignment

sean@upbeing.ai, andrew@upbeing.ai

Summary

We’re moving fast towards AGI. Capabilities research is advancing rapidly while alignment research is lagging behind. The lack of a target variable to aim alignment toward has led to the pursuit of less scalable, and less comprehensive alignment methods such as reinforcement learning with human feedback (RLHF) at OpenAI and the constitutional approach (CA) at Anthropic. As foundational general models approach AGI, the risks of not having a scalable approach to alignment are more pressing than ever.

While the real world does not come with the reward functions necessary for alignment built-in, they can be built. We propose a method of alignment called Embodied Alignment, which proposes that any model with sufficient agency should have two target variables: 1) The specific goal of the model; and 2) The impact of any decision or action on human wellbeing. Through the crowdsourcing of passive data and an experience-to-emotion ML engine, UpBeing is building the reward function to enable embodied alignment based on human wellbeing.

Fundamentally, embodied alignment treats alignment with human values as itself a capability to be trained – not a constraint to be imposed.

Through the UpBeing consumer application, users gain deep insights into the drivers of their own wellbeing and the wellbeing of their loved ones, while also serving as the training basis for a reward function model for AI alignment that is capable of inferring wellbeing through passively recorded signals alone. To do this, we collect experiential data that comes in the form of third-party passive data sources, including wearable devices, e-calendars, Spotify, phone usage, transaction data, and browser activity as well as localized weather and news sentiment data. Once connected, these massive amounts of passive data are then labelled by the person whose experience we are capturing in terms of wellbeing through lightweight mood, productivity, and motivation check-ins coupled with longer-term life satisfaction check-ins.

With UpBeing’s reward function, AGI will be able to understand the long-term and aggregate effects of their decision-making and the impact of their action paths on human wellbeing. UpBeing’s experience-to-emotion model represents an opportunity to both measure wellbeing at scale and to turn alignment into a measurable outcome variable. We believe that this represents a critical first step towards the realization of embodied alignment – an approach that will ensure that human values are so fundamental to any future AI system that misalignment will be unthinkable, even to a super-intelligent AGI.

‍

Current capability and alignment research are moving along orthogonal vectors.

AI capabilities progress; alignment efforts emerge to intercept them. Rules, principles, and policies are useful for regulating what already exists. In the case of AI alignment, however, we need to think about how we can align an intelligence that doesn’t yet exist with human values. This intelligence may exhibit capabilities and intentions that are completely alien to us and advance at a rate that is impossible to comprehend. Alignment needs to be capable of adapting, scaling, and growing alongside unknowable capabilities and intentions. With the uncertainty around the prospect of AGI and the course that it will take, alignment can’t just be accomplished through a regulator’s approach. Alignment needs to be brought about through a builder’s approach wherein human values are as important to a model as the objective that it has been trained to achieve.

For alignment to become scalable, extensible, and flexible, it needs to be embedded in the training of new AI capabilities – not just imposed on existing capabilities. To do this, we propose a concept that we refer to as Embodied Alignment.

Current Approaches to Alignment

Reinforcement learning through human feedback (RLHF) is the backbone of alignment efforts at both OpenAI and Google DeepMind. While there are different variations of this approach, the general concept uses direct human feedback as a mechanism for capturing alignment. This is accomplished through a user interaction protocol (i.e. a thumbs-up or thumbs-down) that trains a reward model. This reward model, in turn, can be paired with reinforcement learning to train a policy model that guides AI behaviour toward maximizing alignment.

A second approach is the constitutional alignment, developed by Anthropic. Instead of using direct user interaction to capture human intention implicitly, constitutional alignment uses a set of human-deliberated principles (a constitution) to normatively guide AI behaviour and output. Then, there is a reinforcement learning phase wherein the preference model trains the original model to produce responses that align with the constitution.

In the case of both constitutional alignment and RLHF, there is a common challenge: they rely on explicit guidance from human beings, who are finite, fallible, and myopic.

Human imagination is limited by what is known – we’re finite. Genuine AGI is likely to introduce unforeseen challenges that will extend beyond our current knowledge field. Once AGI begins controlling more complex systems, and especially when those systems begin to interact with one another, the challenges of understanding AI, much less regulating it, will only grow. We need a method of governance that operates outside of human intention and imagination.
It will be nearly impossible to determine a single set of rules that an AGI must abide by, even with a sufficiently large sample of people providing survey feedback or contributing to a constitution. Human beings are fallible – we get it wrong even in regulating our own behaviour, much less unknowable AI behaviour. We need a target whose validity is as close to being certain as possible.
There are misaligned incentives between capability and alignment, wherein short-term monetary gains from capability advances exceed those of advances to alignment. This myopia has led to a significantly faster pace of capability research compared to alignment research. We need an approach to AI alignment that can scale alongside AI capabilities in spite of the short-term monetary imbalance between the two fields.

Embodied Alignment

To address the shortcomings listed above, we propose a method called Embodied Alignment, which proposes that any model with sufficient agency should have two target variables: 1) The specific goal of the model; and 2) The impact of any decision or action on human wellbeing. Importantly, embodied alignment does not apply to simple models with simple targets and limited agency. Rather, it should be restricted to any model that is capable of expressing intention. Taking the paperclip maximizer as an example, embodied alignment would ensure that paperclips were created only up to the point that they continued to have a net positive impact on human wellbeing.

Fundamentally, embodied alignment treats alignment with human values as itself a capability to be trained – not a constraint to be imposed.

The biggest challenge facing the realization of embodied alignment today is the lack of a clearly defined target variable to aim alignment at (Leike et al., 2018). As DeepMind researchers have articulated: “the real world does not come with built-in reward functions.” This lack of a measurable alignment target has deterred today’s AI labs from pursuing true embodied alignment research. As a result, many alignment approaches have moved away from the idea of an alignment target altogether, as is the case with the constitutional approach. Meanwhile, RLHF approaches are exploring the use of interaction with human feedback in place of a true reward signal. Unfortunately, this RLHF method requires that a significant number of real people must first interact with a potentially harmful model. It also requires that the inputs and outputs of that model are interpretable by humans.

The statement from Jan Leike and the DeepMind team holds true: the world does not come with clearly defined reward functions for AI. UpBeing, however, is founded on the belief that this reward function can be defined, and built. Through the crowdsourcing of passive data and an experience-to-emotion ML engine, UpBeing is building the reward function to enable embodied alignment based on human wellbeing. The aim of developing this target variable is that it may unlock previously unfeasible approaches to alignment, embodied alignment among them.

The UpBeing Model of Human Wellbeing

The concept of wellbeing is typically described in two opposing ways: eudaimonic wellbeing (i.e., the realization of one’s life purpose) and hedonic wellbeing (i.e., the summation of moments of happiness or sadness; Ryan & Deci, 2001). While some researchers argue that overall wellbeing is either eudaimonic or hedonic, UpBeing has taken the perspective of Henderson and Knight (2012), who argue that wellbeing is likely to be some combination of the two: both hedonia and eudaimonia are important, with eudaimonia earning a slightly heavier weight towards overall wellbeing in the long run.

The challenge of measuring wellbeing at scale is that wellbeing is a fuzzy and multifaceted concept. Currently, there exists no device to measure hedonia let alone eudaimonia. The closest approximation to such a device is a simple survey (e.g., Satisfaction With Life Scale; Diener et al., 1985), but as van der Maden et al. (2023) argue: “‘off-the-shelf’ wellbeing assessment instruments may not be readily applicable in the context of novel technologies”. Survey-based measures of wellbeing are likely not agile enough to adapt to the rate at which AI is progressing; meanwhile, they are largely divorced from the context that informs wellbeing (van der Maden at al., 2023). A context-sensitive understanding of wellbeing will be crucial in guiding the alignment of artificial intelligence and systems-level superintelligence.

In response to the challenges associated with existing wellbeing assessment instruments, UpBeing has taken a first principles approach to thinking about what wellbeing actually is: a summation of the collective valence of individuals' experiences in both macro and micro senses. This excerpt from Robinson Crusoe summarizes the idea well:

“... while my ink lasted, I kept things very exact… very impartially like debtor and creditor, the comforts I enjoyed, against the miseries I suffered.” Daniel Dafoe, Robinson Crusoe

Through the UpBeing application, we allow users to connect as much data as they are comfortable with sharing in an effort to form as close to a complete picture of their experience as possible. This data, which we refer to as experiential data, comes in the form of third-party passive data sources, including wearable devices, e-calendars, Spotify, phone usage, transaction data, and browser activity as well as localized weather and news sentiment data. Once connected, UpBeing can collect significant amounts of passive data that are then effectively labelled by the person whose experience we are capturing in terms of hedonic and eudaimonic wellbeing through lightweight mood, productivity, and motivation check-ins coupled with longer-term life satisfaction check-ins. In this way, the gap between behavioural methods of wellbeing assessment and self-reporting methods is effectively bridged. From this, users gain deep insights into the drivers of their own wellbeing and the wellbeing of their loved ones, while also serving as the training basis for a model capable of inferring wellbeing through passively recorded signals.

As our model develops, it will allow wellbeing to be predicted from any number of external data sources. This will allow more generalized models who have access to similar data sources (e.g., the weather, a user’s heart rate, location, etc.) to perform similar inferences, which will allow them to understand current micro and macro impacts to wellbeing as well as to predict future wellbeing. As generalized foundation models approach genuine AGI, they will be able to understand the long-term and aggregate effects of their decision-making and subsequent action paths on both the likelihood of achieving their explicit objective and the impact that action path will have on the wellbeing of the people affected by it.

The Hedonometer as an Example

To understand how such a model could be used, we can turn to a simple example, the hedonometer, developed by Peter Dodds and Chris Danforth at the University of Vermont. The hedonometer examines net sentiment across Twitter over time to track the general pulse of how the world feels on any given day. The hedonometer exhibits peaks and valleys over time that align with common perceptions of positive and negative events in the world, which suggests an anecdotal level of validity. UpBeing performs similar analyses, but instead of looking at net sentiment of what is expressed on Twitter, we can look at individualized sentiments (which can be rolled up in aggregate) at all times, provided that we have access to a data source that has been integrated into UpBeing.

*Hedonometer showing net sentiment across Twitter from late 2020 to early 2021.*

Imagine, for example, if the net sentiment from a tool like the hedonometer was the target that we pointed a super-intelligent AGI toward. This could empower general systems-level intelligence to prioritize human wellbeing at scale. It could look at low events like the invasion of Ukraine or mass shootings and coordinate the operations of systems in a manner that avoids the elements that make those situations bad and the conditions that lead to them in the first place. Similarly, it could identify positive events and situations (e.g., Christmas Day), and optimize for the elements of those situations that make them positive.

UpBeing’s experience-to-emotion model represents an opportunity to both measure wellbeing at scale and to turn alignment into a measurable outcome variable. We will be able to infer anxiety levels from the structure of a calendar or assess if a collection of heartbeats are signaling a positive or negative sentiment in a particular area. We can examine long-term weather patterns and assess their impact on the eudaimonia of a population. Importantly, these inferences can all be made without any direct input from a user.

Future Research

The solutions proposed herein are not meant to be finite and definitive. There are still significant unknowns standing between this proposition and the successful implementation of wellbeing as a metric.

Do we have an internally valid measure of wellbeing that measures what we think we’re measuring?
Do we have an externally valid measure of wellbeing that can extend to other cultures and people?
How do we incorporate wellbeing into the goals of a model with agency?
How do we incorporate wellbeing into the training process of future models?
Will a genuine AGI simply be able to ignore any imperatives that we define for it, even if those imperatives are fully engrained in its training? For instance, many people lose their faith despite their religious upbringing.
What combination of measures can be paired with human wellbeing to ensure alignment?
How do we ensure that this does not usher in an emotional surveillance state?

These are all questions that need to be answered. While some of these are being addressed by ongoing research at UpBeing and our research partners, some remain to be explored. However, given the pace of advancement in AI capabilities, it seems feasible that an AI alignment method based on similar models should be able to advance at a similar rate.

Building for Wellbeing

When social media hit the market, it was intended to promote connection. People rushed to the new technology and the effects on global wellbeing have been catastrophic. Despite a plethora of evidence, we’ve struggled to constrain the harmful impacts of this technology. With social media, humans are still in control – we continue making the same mistakes, but ultimately we control the off switch. The same cannot definitively be said for AI. As AI competency increases, profit incentives will demand that AI agency does as well. This will yield systems whose functions, intentions, and consequences are nearly impossible for humans to grasp. Before weaving AI into the fabric of our digital world, it is essential that we develop methodologies and metrics that will ensure present and future alignment.

UpBeing is developing the reward function for the natural world. We believe that this represents a critical first step towards the realization of embodied alignment – an approach that will ensure that human values are so fundamental to any future AI system that misalignment will be unthinkable, even to a super-intelligent AGI.

‍

Sources:

‍

Embodied Alignment ‍A Measurement-Based Approach to AI Alignment