AI Alignment

AI alignment is a philosophy problem we never solved, not a technical problem.

AI Alignment
Noquinho trying to get away with sleeping with Chi on his bed

AI alignment is a philosophy problem we never solved, not a technical problem.

And we’re now forced not only to solve it, but to write it in executable form, operationalized, and commercialized.

On a tight deadline.

Unsurprisingly, .. we’re not doing well.

Moral humility

2 years ago, I wrote about moral humility [Link]. Among other points, I mentioned in that article that while nobody knows for sure what is truly right and wrong, it’s almost impossible to act with moral humility because that comes across as a lack of integrity.

It’s “easy” to display a lack of integrity, because it’s often so self-serving: you say “family is the most important thing for me” while prioritizing other areas of life that are more fun or engaging.

But there’s another lack of integrity display, the one from moral humility: if you’re not truly sure what’s right and wrong (and nobody really is) it may be right to do (or not do) something despite your current beliefs, given your moral uncertainty about the topic.

This is illustrated when someone asks you “who’s the right party on this war, country X and Y?” and you say “well, I don’t know.” While in theory some judgments can seem easy, the answer to a lot of real life situations is “well, it’s complicated.”

William MacAskill, Krister Bykvist, and Toby Ord tried to write about the lack of attention we have given to this topic on Moral Uncertainty:

Given the prevalence and importance of moral uncertainty, one would expect ethicists to have devoted considerable research effort to the topic of how one ought to make decisions in the face of moral uncertainty. But this topic has been neglected. In modern times, only one book and fewer than twenty published articles deal with the topic at length. [..]

They then go about building an utilitarian framework for making decisions, as utilitarians will, but I digress.

The point is that we, humans, are really behind when it comes to understanding ethics, and in particular ethics where there are tough topics and uncertainty — which is basically everywhere AI is currently applied.

To illustrate, let’s talk about Mill’s harm clause.

But what is harm?

John Stuart Mill is considered the father of utilitarianism (if you count Bentham as the grandfather), but he’s famous for another important article: “On Liberty”, and its notable “harm clause.”

That principle is, that the sole end for which mankind are warranted, individually or collectively, in interfering with the liberty of action of any of their number, is self-protection. That the only purpose for which power can be rightfully exercised over any member of a civilized community, against his will, is to prevent harm to others. His own good, either physical or moral, is not a sufficient warrant. He cannot rightfully be compelled to do or forbear because it will be better for him to do so, because it will make him happier, because, in the opinions of others, to do so would be wise, or even right.

Mill, John Stuart. On Liberty and Other Essays (Oxford World's Classics) (pp. 13-14). (Function). Kindle Edition.

On its face, the harm clause, published in 1859, is clear and hard to argue against: “leave people alone unless they’re harming you or others.”

But what, exactly, counts as harm and justifies our actions and interventions over others? Well, we spent the next near 167 years debating that “detail” .. and got nowhere.

There are many challenges, especially when applied to AI. Here are a few that come to mind:

In a world where you can’t perfectly predict the future, how do you “prevent harm?” Should we use precogs to prevent crimes from happening? Well, that’s what AI did when it was used to judge parole hearings, as those 2020 AI Alignment books illustrate.

When is “interference” defined as “harm?” If someone clicks “like” in a post, does that mean that showing that post wasn’t harmful? Is it harmful to show more posts like it, even if it creates echo chambers? Well, that’s what AI did when it was choosing people’s Facebook feeds (including political implications or experiments with emotional impact), and now is done with X, LinkedIn, Instagram, and TikTok.

When is “harm to others” a justifiable cause, and who are the “others?” When Anthropic drew a line on use of AI for autonomous weapons and domestic mass-surveillance for the US Government, is domestic wrong because they are domestic, or because they are American? Would they be as ethically justified in selling this technology to Iran to surveil Americans, then?

These questions sound tough because they are tough. And they sound relevant because they are relevant. Particularly given the power and accessibility of foundational models today, and in the future.

The parole neural network often quoted in those 2020 alignment books didn’t use much data: crime rates, parole decisions, recidivism rates. The media algorithms use a gargantuan order of magnitude more data: every video you see, for how long, your friends, your likes. Foundational models in turn use roughly all of the internet’s data and potentially orders of magnitude more GPU (or TPUs, etc) in training with state-of-the-art AI implementation — deep learning with transformers.

The scale of impact went from local to day-to-day to nation states in the blink of an eye. In fact, the books that I consider foundational in the topic, Life 3.0 (2017), Human Compatible (2019), The Alignment Problem (2020), and The Coming Wave (2023) all come before the current foundational models were accessible (or anywhere as powerful) as they are today. They’re all talking about “what ifs” when we’re all here, living that reality. Today.

So let’s expand. If I can find so much to analyze and discuss over the current implications of today’s human+AI ethical implementation compared to a single Mill paragraph from 1859, imagine how much there is to analyze and understand when analyzing human+AI ethics against literally all of human values.

After all, when we say “human values” to “align” to.. what are we talking about anyway?

Nobody knows.

The impossibility of “human values”

If defining “harm” was complicated with our current philosophical frameworks, defining “human values” is impossible.

When we say “AI alignment to human values”, what do we mean? I mean, think about it. What do we mean, exactly?

And I ask exactly because we’re literally programming stochastic machines who make decisions based on heuristics we can’t define.

Should machines be consequentialists, and should they perform any act so long as the consequences are good? Utilitarian, and they should act for the greater good for the most people? And who defines what “good” is then?

Or should machines be deontological, where they have rules they should never break? Should they follow Kant’s universal maxim, where they shouldn’t do what couldn’t be turned into a universal law? (never mind all its paradoxes and Parfit’s attempt to correct them, of course.)

Or should they be virtue ethicists, where they embrace virtue and reject vice? Should they have Wisdom? Courage? Justice? .. whose Justice?

Should they lie? When? Should they pull the trolley lever and kill 1 person to save 5? Should they put us in the experience machine? Pro-democracy? Pro-republic? Should they agree with Popper’s Open Society, Nozick’s Anarchy, State and Utopia, or Rawls’s Political Liberalism (the less well-known Theory of Justice sequel)?

And critically: how does it choose the answer to all these questions?

Literally. Nobody. Knows.

But the train is moving. The rails are laid out, and we’re moving forward, throwing more and more coal into the engine, partying like it’s 1870.

But who put these train rails here in the first place?

And where’s our next stop?

God’s hand

If you thought Adam Smith would get away unscathed from this post, you thought wrong — because no post about practical ethics is complete without a jab at his invisible hand. From Wikipedia:

The invisible hand is a metaphor inspired by the Scottish economist and moral philosopher Adam Smith that describes the incentives which free markets sometimes create for self-interested people to accidentally act in the public interest, even when this is not something they intended.

It’s more complicated than this (did you know The Wealth of Nations is commonly abridged for size?) and few have read his Theory of Moral Sentiments, his attempt at ethics (with mixed results.)

But Smith’s premise is somewhat simple: self-interest accidentally leads to public interest.

How? Well, I’ll let the late David Graeber complete this for me (emphasis mine):

Smith was trying to make a similar, Newtonian argument.3 God—or Divine Providence, as he put it—had arranged matters in such a way that our pursuit of self-interest would nonetheless, given an unfettered market, be guided “as if by an invisible hand” to promote the general welfare. Smith’s famous invisible hand was, as he says in his Theory of Moral Sentiments, the agent of Divine Providence. It was literally the hand of God.

Graeber, David. Debt: The First 5,000 Years,Updated and Expanded (p. 44). (Function). Kindle Edition.

Never mind if economic systems should or should not work in particular ways, or if they should change if instead of machines that produce shoes we’re creating machines that produce software: literally God (Divine Providence) will sort it out to make sure it works out for the best.

I exaggerate, but only slightly. If an external observer from another planet hovered over Earth and asked “how are these humans governing the use of AI?” .. their conclusion may not be that different from “keep doing what you’re doing and hope for the best.”

Which is, of course, what all those authors on AI Alignment over the past decade have cautioned us about all along. It's what OpenAI was founded to address in 2015, and what Anthropic spun out of OpenAI in 2021 to pursue when they disagreed on alignment posture.

In fact, when OpenAI’s VP of Research Dario Amodei leaves to found Anthropic over alignment disagreements, when Microsoft disbands its AI ethics team in 2023, when OpenAI dissolves its alignment team after Ilya Sutskever leaves in 2024 and again in February 2026 — to me the answer is the same: it’s the invisible hand that’s removing these teams.

Curiously, we are doing at a global level what many of the AI practitioners do at a local level when using AI to write code:

Dangerously skip permissions

--dangerously-skip-permissions is a well-known flag you append to Claude Code when you want it to run without asking you questions.

Why in the world would you want Claude to run without asking you questions about whether what he’s about to do is aligned to what you want?

Well, because it’s fast! That’s why.

The “dangerously” in the argument name is funny because it highlights both the importance of calling out risks.. and its futility.

Yes, I want Claude to do what I want it to do, but I also want it to do it quickly, cheaply, and if I’m not competitive, others who do it will outcompete me.

As Naval Ravikant said many years ago: “Technology democratizes consumption and monopolizes production: whoever does something best gets to do it for everyone.”

The individual developer skipping permissions has the same incentive as the startup shipping without security review, which is the same incentive as OpenAI rushing a Pentagon deal hours after Anthropic got banned for pushing back on the request to remove their guardrails.

The current incentive structure is evolutionary, and the elimination criteria is market competition.

Because the power of the market is decentralized and the incentive structures work in the way they do today, as they have for the past 120 years, we have no viable alternative besides addressing the market needs and ignoring externalities.

Next steps for AI Alignment

The current situation isn’t encouraging.

It’s not that we didn’t hear the warnings.. we just didn’t heed them. It’s beyond our current capabilities as civilization.

Unfortunately, humanity just doesn’t have the institutional maturity to solve a problem of this magnitude. Not even close.

On SPQR, Mary Beard talks about the creation and fall of the Roman Empire and a common theme is the challenge of running a multi-continental empire on top of our year 0–200 C.E. technology. It’s virtually impossible.

This is a powerful story of political crisis and bloody disintegration, even told in its most skeletal form. Some of the underlying problems are obvious. The relatively small-scale political institutions of Rome, little changed since the fourth century BCE, were hardly up to governing the peninsula of Italy. They were even less capable of controlling and policing a vast empire. As we shall see, Rome relied more and more on the efforts and talents of individuals whose power, profits and rivalries threatened the very principles on which the Republic was based. And there was no backstop – not even a basic police force – to prevent political conflict from spilling over into murderous political violence in a huge metropolis of a million people by the mid first century BCE, where hunger, exploitation and gross disparities of wealth were additional catalysts to protests, riots and crime.

Beard, Professor Mary. SPQR: A History of Ancient Rome (p. 219). (Function). Kindle Edition.

It didn’t end well for them, and their limitations and ours aren’t that distinct given the relative magnitude of the challenges and institutional capabilities.

Our institutions are too behind to manage our own technological progress. Computing and AI have advanced orders of magnitude faster than philosophy, politics, economics, and law.

Like running an empire on horseback messengers carrying letters, we use philosophy from 2,500 years ago, laws from 500 years ago, and economics from 200 years ago to align the AI innovation from last week.

So I don’t have an answer for AI Alignment, but I do have a question:

How do we accelerate not AI, but the institutional capabilities that allow civilization to control it?

Without the answer to that question, I think there’s no answer to AI Alignment.

And I see no answer to that question in sight.