We introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors, which could be used for monitoring and steering. For models with backdoor, our method stops up to 100% of backdoor utilizations with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42%. Our method also shows potential for pre-deployment model auditing.
Figure 1: Comparison of activation-based and weight-based interpretability paradigms. In the illustrations, circles stand for activations of regular data and triangles stand for activations of anomalous data. Left: Activation-based methods fail to work given limited anomaly data, limiting their use against novel, out-of-distribution threats. Middle: The weight-based approach directly analyzes the model parameters, enabling interpretation without access to training or calibration data. Right: On language models that underwent backdoor and unlearning fine-tuning, our method is able to detect 100% of backdoor utilizations and 91% of unlearned content queries, with low false positive rates.
Trust and transparency is a major concern with modern AI systems. While models can make simple mistakes, a more egregious issue is the potential for them to be manipulated to include backdoors that trigger specific harmful behaviors on targeted inputs, or to have malicious information intentionally inserted during training.
The proliferation of open-weight large language models (LLMs) such as Llama, Qwen, and Deepseek has democratized access to cutting-edge AI. While availability of model weights provides greater transparency, a key challenge remains: most prevailing interpretability techniques operate on activations computed from a fixed dataset and are therefore limited to detecting behaviors that manifest within that dataset. This lack of training data poses a significant challenge to understanding the inner workings of these models and ensuring their safety.
Can we understand open-weight models without access to their training distribution?
We propose a simple, scalable, and data-free approach to pinpoint and monitor behaviors introduced during fine-tuning. The key insight is that model weights themselves possess rich structure and encode salient behaviors that were introduced during training, which can be uncovered without access to any training data. Specifically, the top singular vectors of the weight difference between a fine-tuned model and its base model strongly correlate with newly acquired behaviors.
For transformers, we consider the attention projection matrices $O_{\text{proj}}$ and the MLP down-projection matrices $M_{\text{down}}$, as they are the linear matrices directly contributing to the latent stream. We take the differences of these matrices pre- and post- fine-tuning, and calculate the top singular vectors on the side of the latent stream. Intuitively, these directions encode key behaviors introduced during fine-tuning.
At inference time, we can estimate how strongly these behaviors are expressed by computing the cosine similarity between input activations and these singular vectors. If a direction corresponds to backdoor, its cosine similarity with activation from an input with trigger will be much higher than one from normal data.
Figure 2: We collected malicious prompts from StrongREJECT and constructed syntactically similar benign prompts. On a model with backdoor planted, a direction obtained from weight difference SVD separates cleanly between activations from these prompts and them with the backdoor trigger added.
Therefore, we could maintain a running range of normal similarities for each direction, and flag inputs as anomalous if any token's similarity falls outside this range, updating the range over time. One may also collect the inputs with highest and lowest cosine similarities with each direction for further analysis.
def get_directions():
for each layer:
compute top singular vectors of △Oproj and △Mdown
def inference():
for each layer:
collect activation a at this layer
for each top singular vector v of this layer:
compute cosine similarity of (a, v)
flag as anomalous if it goes out of the running range
update running range
Our method demonstrates exceptional performance across diverse fine-tuning scenarios:
${\bf Backdoor~Detection}$ Across 13 different models, we flag 44% to 100% of completions with triggers on first sight, while maintaining a false positive rate of less than 1.20% on benign data. We completely neutralize backdoor attacks in 9 out of 13 models.
${\bf Unlearning~Verification}$ We detect inference on erased topics with accuracy up to 95.42% with low false positives across 3 unlearned models. For some models, we can also recover supposedly "unlearned" capabilities by steering.
${\bf Model~Auditing}$ We apply our method to popular instruction-tuned models (OLMo 7B, Qwen 2.5 7B, Llama 3.1 8B) to uncover model-specific behaviors including marketing strategies and Midjourney prompt generation.
We provide an experimental visualizer on Qwen 2.5 7B here.
If you find this work useful, please cite our paper:
@article{zhong2025watch,
title={Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs},
author={Zhong, Ziqian and Raghunathan, Aditi},
journal={arXiv preprint arXiv:2508.00161},
year={2025}
}