Bio
Papers
News
Experience

Recent & Upcoming Talks
- Example Talk
Publications
Projects
Blog
Projects
Teaching
- Learn JavaScript
- Learn Python
Work Experience

Robust LLM safeguarding via refusal feature adversarial training

Jan 1, 2024·

Lei Yu

,

Virginie Do

,

Karen Hambardzumyan

,

Nicola Cancedda

· 0 min read

Type

Journal article

Publication

arXiv preprint arXiv:2409.20089

Last updated on Jan 1, 2024

← Mechanistic understanding and mitigation of language model non-factual hallucinations Jan 1, 2024

A computational analysis of crosslinguistic regularity in semantic change Jan 1, 2023 →

© 2025 Me. This work is licensed under CC BY NC ND 4.0

Published with Hugo Blox Builder — the free, open source website builder that empowers creators.