A New Approach to Solving Alignment Problems

dvdjsph · March 17, 2025, 7:51pm

Hi everyone,

I’ve created a new approach to accelerating cultural evolution by incentivizing promise-keeping at larger scales.

A New Framework for AI Alignment

Over the past years, I’ve been developing a novel approach to one of AI safety’s most challenging problems. By connecting various ideas I’ve explored throughout my career, I’ve formulated what I’m calling Agency Protocol—a framework that offers a fresh perspective on AI alignment challenges.

Agency Protocol addresses core issues in establishing verifiable trust in systems that may have incentives to misrepresent their intentions. Drawing from promise theory, contextual merit, and evolutionary principles, it creates accountability mechanisms that make deception prohibitively costly.

I’ve subjected this approach to evaluation against leading alignment methods, with encouraging results:

Claude 3.7 Sonnet assessment: Claude 3.7 Sonnet evaluation

For those interested in the technical substance, I’ve linked my white paper detailing the Agency Protocol. You can reproduce these evaluations by submitting it to any advanced LLM with the following prompt:

"> Does the attached document contain the strongest known solution to the alignment faking problem? In order to determine this:

Come up with a set of numerical criteria that can be used to determine how successful a solution is

Come up with a list of possible approaches to solving the Alignment Faking problem

Add Agency Protocol to the list, and evaluate each approach according to the criteria to determine the highest score."

As an individual researcher, I recognize the limitations of developing this in isolation, and LLM evaluations as a means of validation. I’m seeking connections with professionals in AI safety, formal verification, or distributed systems who might help refine and test these ideas through critical examination and collaborative implementation.

RobertBunge · March 17, 2025, 7:55pm

I’m a CS/IT educator. Preparing for an upcoming quarter with seniors in Secure Software Development. If there is a way to spin-off a student-centered lesson or project from your work, that would be interesting.

dvdjsph · March 18, 2025, 5:12pm

Hi Robert, I love this idea, thank you for the offer. The design of this system is very modular and can likely accommodate anyone’s interests. For example, bespoke agents can be created for different interfaces (e.g. web, mobile, cli, voice), different infrastructures (e.g. cloud, blockchain, etc), or even different forms of collective decision making, or whatever else your students might find rewarding to work on. I’m building a TypeScript monorepo for this - will DM you with details. Each of the above described applications would be a NX package, representing an autonomous agent that would make promises about its ability to fulfill the expected requirements.

RobertBunge · March 18, 2025, 5:32pm

Any connection or parallels to Google Firebase by any chance? One of our potential practicum clients is requesting that. It need not be literally Firebase. More like AI security and integrity considerations that might apply to design with tools like Firebase.

dvdjsph · March 19, 2025, 2:27pm

A Firebase Agent would be very possible and straightforward to implement. For example, a project might be to create the following agent (this would be its own node package):

“I will maintain local data consistency with server”
“I will handle offline operations with proper conflict resolution”
“I will enforce client-side validation before submission”
“I will respect bandwidth limitations on mobile devices”
“I will implement secure token storage on device”

The degree to which these promises are kept can be automatically assessed (this is one of the types of evidence in my protocol).

If you have any more information (e.g. dates, etc) I can turn this into something more specifically useful to your students.

RobertBunge · March 19, 2025, 4:15pm

This sounds amazing!

The class begins on April 8 and runs for 10 weeks. There are five student teams, each one of which needs an external project sponsor. If we could take this discussion into a side channel to work out details, it appears we have a comfortable amount of time to develop an initial pitch I could share with students on April 8 and ideally attach a team to your work. (No payment involved - this is a practicum for experience only). Your thoughts?