Key Takeaways
- One-shot thesis (A1)
- No unilateral avoidance (A2, A3, A16)
- No safe pivotal acts (A4, A13, A16, A20)
- Inner alignment hard (A5, A7, A8, A9, A17, A25)
- Interpretability hard (A5, A9, A15, A16, A17, A18, A28)
- New alignment failure modes emerge with increasing intelligence (A6, A26, A32, A33)
- Outer alignment hard (A10, A14, A19, A27)
- Generalizing capabilities is easier than generalizing alignment (A11, A12, A21)
- Superhuman AGI is scary (A22, A23, A24, A25)
- Multipolar schemes are dangerous (A30, A31)
- Civilizational inadequacy (A35, A36, A37, A38)
- A29 is sort of a misc point about inner misalignment of language models specifically
Point-By-Point Summaries
- A1: we only get one chance to align an agent with the ability to kill us - if we fail, we will be killed, and will not be able to try again. (the One-Shot Thesis)
- A2: someone will build AGI eventually, we (people concerned about alignment) can't unilaterally avoid the problem by not doing it ourselves.
- A3: someone will build AGI strong enough to be dangerous, we can't unilaterally stick to safe systems.
- A4: no pivotal act (which prevents dangerous agents from coming online until we’ve solved alignment) can be performed by an agent not powerful enough to kill us