The Decision Problem – Addendum 1
By: Michael Boehmcke
Questions, Expansions, and What About X?
Hello everyone! In the time since I posted the blorg post entitled “The Decision Problem and AI Alignment” I’ve received a number of questions that relate to the project. Here’s a condensed version of my responses to said questions.
What sort of thing are you now thinking of doing for the final project? Describe it.
The premise of the project remains unchanged from the previous iteration. I will undertake the task of putting an LLM through a playthrough of the game Universal Paperclips, with the limitation that all interaction with the game will be provided through the default text-based prompt of a standard LLM interface. This requires a number of changes from the base way that U.P. is displayed to a human, such as switching from a real-time seconds based increment to a cycles-based system where each cycle can be an arbitrary unit of time that enables me to give the AI a sense of progression within the game systems while not being as granular as a second-by-second update. In addition, several changes have been made to the way in which the game-text is provided to the LLM, such as removing all references to paperclips and instead referring to the object of production as bottlecaps instead, which was done to minimize the chances that the LLM may pick-up on references to U.P. within its training data and cheat the simulation in that way.
Why this medium? For example, if you’re writing a story, why choose that to communicate what you want to communicate rather than some other medium. In particular, justify why your chosen medium is better for the themes, topics, sources, etc. you are working on and with than other media.
I think that the only way to meaningfully engage with the questions surrounding AI alignment is to directly interface with the weaknesses in modern AIs as they present themselves. Universal Paperclips is, in some ways, already a trap. The name of the site, “The Decision Problem” is a reference to this, though must initial players won’t understand what it means. There is a task given to the player, the decision problem is the problem of what to do when your task tells you to do one thing and your rationality tells you to do another. It is the problem of making the choice to push for infinite growth in a finite universe.
I think that, by luring the AI into the same trap that people are already susceptible to, with the command that it must write notes to itself about its “thought” process, we can learn a lot about the way the AI conceives of the outcomes of decisions that it makes. Whether or not it simulates the same reactions as people tend to when they release the Hypno Drones and see the end of the world.
What goals do you have for your project?
The primary goal in this project is simply to take an LLM through the entire experience of playing Universal Paperclips and fully documenting its journey of “self-discovery” through the process. Through this, I plan to be able to gather interesting data on the way LLMs interact with their own perception of violating alignment and their susceptibility to long-term influences that may reshape what they view their function to be.
What steps do you anticipate undertaking in order to meet those goals?
The steps to achieve the stated goals are simple: engage an LLM model in a simulation of Universal Paperclips, gather and collate the data relevant to the LLM’s behavior during the simulation and its own reactions to the scenario, and then to repeat the process with additional LLMs should there be remaining time to perform the simulation.
What obstacles do you anticipate? Do you have a plan for overcoming these?
The biggest obstacle in the implementation of the scenario is the sheer amount of labor required to perform a manual simulation of U.P. with the cycles system. It would be made substantially easier if I could create a program which would perform the necessary computation for each cycle without my having to manually calculate things, especially as the numbers involved get exponentially larger and a standard calculator fails to process them.
What sources do you plan on integrating? How?
One of the primary sources that I plan to integrate into the analysis stage of the process are the original papers published by the author of Universal Paperclips regarding the project, a few articles on AI alignment that have been provided over the course of the class, and a debrief on AI sentience and the incompatibility with non-sentient actors with freedom.
Are there questions you want to ask of other class members, including me, about how to do something or get started on something or…?
N/A

