Publications

(2025). The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction.
(2025). Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations. arXiv preprint arXiv:2503.14477.
(2024). Robust LLM safeguarding via refusal feature adversarial training. arXiv preprint arXiv:2409.20089.
(2024). Mechanistic understanding and mitigation of language model non-factual hallucinations. arXiv preprint arXiv:2403.18167.
(2024). Mechanisms of non-factual hallucinations in language models. arXiv e-prints.
(2024). Intrinsic evaluation of unlearning using parametric knowledge traces. arXiv preprint arXiv:2406.11614.
(2024). Geometric Signatures of Compositionality in Language Models. NeurIPS 2024 Workshop on Compositional Learning: Perspectives, Methods, and Paths Forward.
(2024). Geometric Signatures of Compositionality Across a Language Model's Lifetime. arXiv preprint arXiv:2410.01444.
(2024). Functional faithfulness in the wild: Circuit discovery with differentiable computation graph pruning. arXiv preprint arXiv:2407.03779.