Robust LLM safeguarding via refusal feature adversarial trainingJan 1, 2024ยทLei Yu,Virginie Do,Karen Hambardzumyan,Nicola Canceddaยท 0 min read CiteTypeJournal articlePublicationarXiv preprint arXiv:2409.20089Last updated on Jan 1, 2024 ← Mechanistic understanding and mitigation of language model non-factual hallucinations Jan 1, 2024A computational analysis of crosslinguistic regularity in semantic change Jan 1, 2023 →