The alternative text for this image may have been generated using AI. Previous works on finetuning safety largely target misuse-related finetuning attacks that make models comply with harmful requests ...
Ideally, artificial intelligence agents aim to help humans, but what does that mean when humans want conflicting things? My colleagues and I have come up with a way to measure the alignment of the ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results