Full-Stack Alignment Paper / Idea from Edelman, Lowe et al at Meaning Alignment Institute

Paper

Archived copy here:
full-stack-alignment-paper.pdf (974.7 KB)

Hot-take

I have a lot of time for Edelman et al and really liked values and alignment paper with its wisdom graph elicitation. That said, I’m a bit dubious about the mileage we’re going to get on AI alignment from this work (but e.g. the WGE stuff is really interesting irrespective of AI - it is interesting for human group alignment etc). I half-wonder whether they are using all of the excitement/concern around AI to fund/resource a bunch of really interesting other work.

1 Like

Thanks for sharing this! All this maps rather well to my upcoming research topic session on Sept. 19.

The model I favor is very much human in the loop. I’d prefer to see embodied humans as the ultimate arbiters of values (or if we prefer to cast some version of divinity in that ultimate arbiter role, then divine will as specifically communicated to humans). That said, though, getting AI better and better aligned with humanistic values seems the challenge of the moment. The technical ideas in this article are well worth considering in that light.

Broadly in favour of the (sociotechnical) paradigm MAI are using. Two issues I had with their two papers:

Moral Graphs: seems to be simply eliciting the existing distributional landscape of values as the source of an alignment target (i.e democratic legitimacy). But what if people have the wrong values, and worse, cannot recognise this? (i.e like a mass psychology stealthy vice)

Full stack: primarily pertains to vertical alignment from individuals up to recommender systems, companies, government regulators. It seems to be missing horizontal alignment (full-level?) between micro-agents and between macro-agents (i.e across stacks).

______________________________________________________

I’m also critical of the ‘Pluralistic Alignment’ wave that MAI, Plurality Institute, Foresight, Forethought and many other research orgs in the space are a part of. I’ve written some thoughts up on this here: The Dark Side of Meta-Utopia

“This can be done without imposing a particular vision of individual or collective flourishing” (Full-Stack)

“the better futures perspective doesn’t mean endorsing some narrow conception of an ideal future… viatopia is a state of society… where many different moral points of view can flourish” (Better Futures)

“democracy and collaborative technology can power infinite diversity in infinite combinations” (Plurality)

1 Like

Drilling into the notes on these papers, the work of Charles Taylor on values jumped out at me. I’m currently rereading Taylor’s Sources of the Self. In many ways, current questions around AI operator intent replicate issues that surfaced with the advent of utilitarianism in the high Enlightenment 2.5 centuries ago. Namely, why would maximizing personal utility maximize for larger social goods as well ? It’s not at all self-evident that this should be the case. Keen as the early utilitarians were to distance themselves from religion, metaphysics, or any sort of ontological values hierarchy, pure utility - based on personal pleasure/pain calculus - did not convincingly provide an account of why such mechanisms would promote a more generalized good. So even the early utilitarians smuggled in what Taylor calls other moral sources, which sources in many cases echoed prior ancient or Christian thinking on the nature of the good. Current efforts to create moral maximization functions for AI do not strike me has having untangled this specific knot or resolved the underlying difficulties especially better.

Because the Taylor article from 1977 and the book from 1995 cited here were not readily available to me online, I pasted an AI summary of Taylor’s thinking below, to provide context for anyone here who wishes more background on the social theory behind “strong evaluation”.

Quoted Summary:

For Charles Taylor, human agency is characterized by being able to make qualitative distinctions about life, the ability to evaluate one’s own feelings and desires, and the capacity to define one’s identity and life within a shared moral framework that arises from dialogue with others. It involves a self-interpretation of actions, emotions, and preferences, and is deeply intertwined with one’s sense of morality, spirituality, and the meaning of life, distinguishing humans from other agents.

Here’s a breakdown of Taylor’s view on human agency:

Taylor argues that human agents are distinguished by their ability to make “strong evaluations,” meaning they can distinguish between different kinds of goods and make qualitative judgments about what is valuable or good in life. This involves understanding not just what one wants, but also what is truly worthwhile.

The Role of Identity:

Human agency is intimately tied to the concept of the self and identity. Our identity is shaped by our commitments and our “moral framework,” which provides the horizon within which we understand ourselves and our place in the world.

Dialogue and the “We”:

Taylor emphasizes that identity and self-understanding are not formed in isolation. They are developed through “dialogue”—both overt and internal—with others. This includes social interactions that provide the language and concepts to define ourselves.

Self-Interpretation:

Humans interpret their own emotions, feelings, preferences, and actions. This self-interpretation is a key aspect of being a responsible agent, as it shapes how one understands one’s own life and responses.

A Moral Horizon:

For Taylor, human agency operates within a moral framework that provides “strong qualitative distinctions”. This framework allows for questions about good and bad, meaningful and trivial pursuits, which are essential for defining one’s identity and taking a stand in the world.

Indeed, what if? As an opening stab at this, how might one recognize that people have the wrong values? Are we not also people? How do we know our own values are also not wrong?

Referring back to the work of Charles Taylor (my summary here is extremely compressed), some influential strategies in the history of the west include:

  • right reason is aligned with the natural order (e.g. Plato and Aristotle). Values that align with that order are the better values.
  • human reason is disordered and requires divine guidance toward higher values (Augustine). The pathway to true values requires an encounter with the divine through an inner journey.
  • order in nature is something we construct through the activity our disengaged reason (Descartes, Locke). Understanding of the divine and transcendent moral order can also be obtained through right reason, although nature itself is to be engaged objectively, not through notions of providence or a great chain of being.
  • human reason itself is the measure of all things (high Englightenment)

There are of course many other schools of thought that could be cited. These are just a few outstanding signposts for discussion purposes. One key insight I gained from Taylor was the critical role Augustine played in the transition from ancient philosophy to Descartes and further rationalism. Namely, Augustine introduces the dimension of human subjectivity or inwardness. Although Augustine and religious thinkers after him use that as a jumping off point for faith, sin, redemption, etc., later more secular thinkers (Descartes is transitional in this regard) focus more on the centrality of human subjectivity itself.

OK, so what’s all that got to with AI? Garbage in, garbage out, essentially. It’s hard to imagine AI becoming a literal deus ex machina that is any better than the humans who train it. There are of course many different views on how we might become better humans (small sample of such differences in the short list above). Generally speaking, it strikes me that taking on questions of world order, politics, and human values alignment for survival and/or flourishing is its own project requiring its own attention, prior to any consideration of AI. Once we are clear on where we might like to be going, then perhaps AI can help take us there.

As an initial stab at this sort of thing, I just now ran the session below in Copilot. I’d be very curious what others think about this result.

======

On a percentage basis, with 0 being incompatible and 100 being completely compatible, how compatible are the values of freedom and equality?

Copilot said:

Copilot

The compatibility of freedom and equality depends heavily on how each is defined and prioritized within a society. On a percentage basis, here’s a thoughtful estimate:


Estimated Compatibility: ~70%


Why Not 100%?

  • Freedom often emphasizes individual autonomy, choice, and minimal interference.

  • Equality focuses on fairness, equal access, and reducing disparities.

These values can conflict when:

  • Freedom of economic action leads to wealth inequality.

  • Freedom of speech allows views that challenge equality.

  • Equality policies (e.g., redistribution) may limit certain freedoms (e.g., taxation affecting property rights).


Why Not 0%?

  • Both values are foundational to democratic societies.

  • They reinforce each other in many contexts:

    • Equal rights enhance personal freedom.

    • Freedom from discrimination promotes equality.

  • Many modern constitutions and legal systems strive to balance the two.


Key Factors Affecting Compatibility

Factor Increases Compatibility Decreases Compatibility
Definition of Equality Equality of opportunity Equality of outcome
Definition of Freedom Freedom with responsibility Absolute freedom
Political System Liberal democracy Authoritarianism or laissez-faire extremes
Economic Model Mixed economy Pure capitalism or socialism

Indeed. There are two parts (at least) of wisdom:

  • Valueception: perceiving the goods (or, more accurately the grammar of the good i.e. how it manifests in forms of more specific values)
  • Discerning what will act towards or produce those good(s)


From Wisdom: a practical capacity to choose a good path

And collective wisdom has a third part:

  • “Aggregating” individual valueception and discernment into collective value and action.

The WGE is really about the last part.

If people have poor valueception then aggregation won’t help.

1 Like

This reminded me of how Sam Harris distinguishes between the three projects of a ‘science of morality’: the descriptive, the normative and the memetic.

I think we should differentiate three projects that seem to me to be easily conflated, but which are distinct and independently worthy endeavors.

[DESCRIPTIVE] The first project is to understand what people do in the name of “morality.” We can look at the world, witnessing all of the diverse behaviors, rules, cultural artifacts, and morally salient emotions like empathy and disgust, and we can study how these things play out in human communities, both in our time and throughout history. We can examine all these phenomena in as nonjudgmental a way as possible and seek to understand them. We can understand them in evolutionary terms, and we can understand them in psychological and neurobiological terms, as they arise in the present. And we can call the resulting data and the entire effort a “science of morality”. This would be a purely descriptive science of the sort that I hear Jonathan Haidt advocating. For most scientists, this project seems to exhaust all that legitimate points of contact between science and morality — that is, between science and judgments of good and evil and right and wrong. But I think there are two other projects that we could concern ourselves with, which are arguably more important.

[NORMATIVE] The second project would be to actually get clearer about what we mean, and s hould mean, by the term “morality,” Understanding how it relates to human well-being altogether, and to actually use this new discipline to think more intelligently about how to maximize human well-being. Of course, philosophers may think that this begs some of the important questions, and I’ll get back to that. But I think this is a distinct project, and it’s not purely descriptive. It’s a normative project. The question is, how can we think about moral truth in the context of science?

[MEMETIC] The third project is a project of persuasion: How can we persuade all of the people who are committed to silly and harmful things in the name of “morality” to change their commitments, to have different goals in life, and to lead better lives? I think that this third project is actually the most important project facing humanity at this point in time. It subsumes everything else we could care about — from arresting climate change, to stopping nuclear proliferation, to curing cancer, to saving the whales. Any effort that requires that we collectively get our priorities straight and marshal massive commitments of time and resources would fall within the scope of this project. To build a viable global civilization we must begin to converge on the same economic, political, and environmental goals.

[VALUECEPTION] Obviously the project of moral persuasion is very difficult — but it strikes me as especially difficult if you can’t figure out in what sense anyone could ever be right and wrong about questions of morality or about questions of human values. Understanding right and wrong in universal terms is Project Two, and that’s what I’m focused on.

1 Like