General
- Four Background Claims: good high-level overview, maybe too concise to be useful for first exposure?
- Robert Miles: really accessible explanations of various concepts in AI safety (if you want, you can speed up these videos no problemo)
- In particular (roughly ordered), check out his videos on orthogonality, instrumental convergence, mesa-optimizers
- For some proposed solutions, look at iterated distillation and amplification, and reward modelling
- Risks From Learned Optimization: paper introducing inner alignment
- (known to be confusing on first exposure, check out Rob Miles for high-level gloss)
- Distinguishing AI Takeover Scenarios: overview of failure modes
- Overview of 11 proposals for safe AI: overview of potential solutions
- Refactoring Alignment: pointing at some fuzzy concepts/subproblems
Eliezer’s List O’ Doom: seems very good for making connections and getting a sense of the high-level problem factoring once you’re familiar with the object-level material needed for these points to make sense. Contains a mix of useful stuff and non-useful stuff, IMO. Maybe see also:
My Summary of Krakovna/Kumar’s Summary of Yudkowsky’s 38 Reasons Why We’re Fucked
Exercise to check for understanding:
Try and answer each of the following questions in your own words - maybe a couple sentences each? The goal is to understand, start-to-finish, all the steps in the threat model.
- Why might an AI take actions that are not beneficial for humanity?
- What makes general AI so much more dangerous than narrow AI?
- How could a well-meaning (or at least, not explicitly apocalyptic) actor end up releasing a misaligned AI?
- What are the steps that need to be taken before the world is AI Safe?
- What’s a path (any path) we could take to never having to worry about this again? What’s a shorter path? The shortest path?
Some examples of current work in various subfields (TIME SINK WARNING. SKIM)