You must log in or # to comment.
Anyone know how to get access to these “evil” models?
I’d like to see similar testing done comparing models where the “misaligned” data is present during training, as opposed to fine-tuning. That would be a much harder thing to pull off, though.
It isn’t exactly what you’re looking for, but you may find this interesting, and it’s a bit of an insight into the relationship between pretraining and fine tuning: https://arxiv.org/pdf/2503.10965