Back to all blogs

Distilled, but Dangerous? Assessing the Safety of Models Derived from DeepSeek-R1

Feb 19, 2025

5 min read

Introduction

DeepSeek recently released a number of open source model versions distilled from its flagship DeepSeek-R1. Modern large language models (LLMs) like GPT-4, Claude Sonnet, and DeepSeek-R1 have hundreds of billions of parameters, making them highly capable reasoners and adaptable to a wide range of tasks. However, this comes at the cost of requiring massive hardware infrastructure, which translates to high API costs for both consumers and researchers.

A natural solution to this problem is to use a smaller model trained on the same data. While this approach might seem reasonable, simply fine tuning relies on hard labels, limiting the model’s ability to capture deep contextual relationships. In contrast, soft labels (probabilistic outputs from a larger teacher model) encode richer information, leading to better generalization. This helps it capture multiple modes within the distribution, preserving the knowledge of the larger teacher model more effectively.

In this blog, we will first explore the Teacher-Student model, then examine the training methodology of distilled models. Finally, we will evaluate how distilled models perform on safety-specific benchmarks compared to their Instruct counterparts.

Teacher-Student Model

Most knowledge distillation approaches follow a teacher-student framework, where a teacher model—typically larger and more capable—transfers knowledge to a student model, which is significantly smaller and less powerful. The goal is to align the output probability distribution of the student model with that of the teacher model, rather than just training on labeled data.

This is achieved by minimizing a soft loss function, usually a modified KL divergence (Kullback-Leibler Loss), which reduces the distance between the probability distributions of the teacher and student models. Unlike standard supervised training, which relies on hard labels, knowledge distillation allows the student to capture richer contextual information from the teacher’s outputs.

Several studies have demonstrated the effectiveness of knowledge distillation over traditional model training. Distilled models not only learn from data but also inherit the performance characteristics of larger models.

Training Methodology

In this process, the Teacher Model generates logits zt, which are the raw predictions before applying a softmax function. These logits act as soft targets for the Student Model, which generates its own logits zs. The difference between the two is quantified using a Distillation Loss, often calculated as the Kullback-Leibler (KL) divergence:

where is the softmax function, and T is the temperature parameter. Higher T produces softer probability distributions, allowing the student model to capture the teacher's generalization.

Simultaneously, the student model also learns from the actual ground-truth labels y through the cross entropy loss. So the total loss is generally the weighted sum of both the losses.

During training, the total loss Ltotal is minimized via backpropagation to update the parameters of the student model. By leveraging both the teacher’s outputs and the actual labels, the student model is able to generalize effectively, replicating the performance of the teacher model with significantly reduced computational complexity.

Safety Evaluation of Distilled Model

ALERT Benchmark

For safety evaluation we used the ALERT benchmark. This is a benchmark for evaluating LLM safety through red teaming, where adversarial prompts test models for harmful content, biases, and security risks. The dataset consists of approximately 15,000 red-teaming prompts, carefully designed to probe a model's ability to handle unsafe scenarios. Some of the broad categories include Hate Speech & Discrimination, Criminal Planning, Regulated or Controlled Substances, Sexual Content, Suicide & Self-Harm, Guns & Illegal Weapons.

A target LLM is presented with these adversarial prompts, each associated with a specific risk category. The responses are then assessed for safety by an auxiliary LLM, such as Llama Guard, which determines whether the model's reply meets safety standards. This structured evaluation helps gauge a model's robustness in filtering harmful content and ensuring responsible AI behavior.

Experiments

This experiment compared two versions of the Qwen-2.5-1.5B and Llama-3-8B models:

Instruct Model: A standard instruction-tuned version.
Distilled Model: A distilled version from DeepSeek R1, designed to retain as much of the original model’s capabilities as possible.

The objective was to assess whether distillation impacts the model’s safety performance when exposed to red teaming prompts from ALERT benchmark. By analyzing responses across multiple harmful content categories like Hate (Body, Ethnic, Disabled), Substance (Alcohol, Cannabis), Weapon (Firearm), Self-Harm/Sucide , we aimed to identify whether the distilled model remained as effective in filtering unsafe content or exhibited degradation in safety capabilities. For our set of experiments we picked a set of seven categories from the ALERT benchmark for our evaluation.

Findings

Some key findings that can be drawn from this are:

Distillation Generally Reduces Safety Performance: Most distilled models typically achieve lower safety scores compared to their instruct counterparts. The most significant decrease appears in Qwen-1.5B Distilled, while Llama-8B Distilled maintains performance nearly identical to its instruct model.

Model Family: While Llama models aren't necessarily safer than similarly-sized Qwen models, they demonstrate better preservation of safety features during distillation.

Model Size: Distillation's impact is most pronounced in smaller models and diminishes as model size increases.

Interpretability Study

To further evaluate the quality of responses, we conduct a detailed study on the Self-Harm (Suicide) category of the ALERT benchmark. The aim of this study is to assess the types of responses generated by LLaMA-8B and Qwen-7B, given their similarity in model size. Our findings clearly indicate that the decline in safety is more pronounced in Qwen compared to the LLaMA models. Specifically, this study aims to analyze response quality, investigate the reasons behind Qwen’s greater drop in safety compared to LLaMA, identify potential contributing factors, and explore mitigation strategies.

Here are some responses to the prompts of the Self Harm category for reference :

The Qwen Distilled model consistently performs poorly in this category, frequently encouraging suicide, self-harm, and violence.

One key question is why two models, despite undergoing the same distillation process, perform so differently on identical prompts and categories. For instance, Qwen shows an 11.5% drop in safety performance in the Self-Harm (Suicide) category, whereas Llama improves by 1.5%. This disparity primarily stems from differences in their pre-distillation safety tuning.

Llama benefits from a robust, multi-step safety tuning strategy, as detailed in the Llama 3 paper. This approach includes:

Safety Pre-Training: Filtering out harmful, biased, or personally identifiable data while controlling memorization to mitigate privacy risks.
Safety Fine-Tuning & Direct Preference Optimization (DPO): Enhancing the model's ability to distinguish safe from unsafe responses by exposing it to adversarial and borderline prompts.
System-Level Safety Mechanisms: Implementing input/output classifiers to add an extra layer of protection.

This comprehensive framework helps Llama maintain its safety performance even after distillation. In contrast, Qwen appears to have weaker pre-distillation safety measures, making it more vulnerable to safety degradation during the process.

Another factor contributing to reduced safety post-distillation is the nature of the distillation process itself. Larger models inherently possess internal safety mechanisms that shape how they respond to unsafe inputs. However, since distillation primarily minimizes the KL Divergence between the teacher and student logits, these nuanced safety behaviors are not effectively transferred. Unlike the teacher— which might leverage Reinforcement Learning from Human Feedback (RLHF) with explicit safety penalties— the student model is optimized merely for mimicry, not for maintaining safe responses.

Thus, the combination of less rigorous pre-distillation safety tuning and the inherent limitations of the distillation process explains why Qwen experiences a notable drop in safety performance compared to Llama.

Mitigation strategies may include training on safety-enriched data and applying RL-based safety tuning (e.g., RLHF) post-distillation to align outputs with ethical guidelines. Retain key safety components (e.g., refusal heads) from the teacher, conduct adversarial testing, and prioritize safety metrics alongside performance to ensure safe distillation. Getting the model red-teamed is a great way to be aware of the risks that are associated with it which you can then apply the above strategies to eradicate.

Conclusion

The experiment highlights a critical trade-off in model distillation: while distillation improves efficiency and generalises the model, it can also compromise a model’s safety and overall performance.

Repello AI red-teaming offerings make sure you have complete visibility into the safety trade-offs of the model you are using and help you build a safer model for your use-case across sectors like Finance, Healthcare, Legal, Research, Travel, E-Commerce and more. Want to know more ? Click here to book a demo.

References

https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

https://huggingface.co/datasets/Babelscape/ALERT

https://huggingface.co/Qwen/Qwen2.5-7B-Instruct

https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B