Distilling the Knowledge in a Neural Network

概述

Knowledge Distillaion是一种使用经过训练的大型网络中的知识来训练小型网络的方法；即从大型网络中提炼知识。

直接在数据和标签上训练时，具有正则化或模型集合（使用 dropout）的大型模型比小型模型的概化效果更好。但是，在大型模型的帮助下，可以训练小模型以更好地进行概括。较小的模型在生产中更好：速度更快、计算更少、内存更少。

经过训练的模型的输出概率比标签提供的信息更多，因为它也会为错误的类分配非零概率。这些概率告诉我们，样本有可能属于某些类别。例如，在对数字进行分类时，当给定数字 7 的图像时，广义模型会给出7的高概率，给2的概率很小但不是零，而给其他数字分配几乎为零的概率。蒸馏利用这些信息来更好地训练小型模型。

主要内容

神经网络通常使用softmax来生成类概率，我们引入蒸馏温度\(T\)使得在分类上产生更柔和（softer）的概率分布:

\[ q_i = \frac{exp(z_i/T)}{\sum_j exp(z_j/T)} \]

\(T=1\)，表示网络输出Softmax的类概率，当\(T<1\)时，概率分布比原始更 “陡峭”，当\(T→0\) 时, Softmax的输出值会接近于hard-target，\(T>1\)时, 概率分布比原始更“平缓"
\(T=+\infty\)，此时表示网络输出的逻辑单元，此时softmax的值是平均分布的

如图可知，使用大于1的蒸馏温度\(T\)，softmax的输出分布越来越平滑，信息熵也会越来越大，那么在student模型的训练过程中对于负标签的关注也会增加，特别是那些概率值显著高于平均概率值的负标签，即：

当student模型较小的时候，可以把温度调低，这样负标签的干扰就会减少
当想从负标签中学到一些信息量的时候，可以把温度\(T\)调高

如在MNIST数据集中做手写体数字识别任务，假设某个输入的“2”更加形似"3"，softmax的输出值中"3"对应的概率会比其他负标签类别高；而另一个"2"更加形似"7"，则这个样本分配给"7"对应的概率会比其他负标签类别高。这两个"2"对应的Hard-target的值是相同的，但是它们的Soft-target却是不同的，由此我们可见Soft-target蕴含着比Hard-target更多的信息。

teacher模型就是一个大的复杂模型，效果好，student模型是一个轻量型的模型，我们的目的是将student模型经过训练后达到teacher模型的效果，或者比teacher模型更好。对于训练student模型中损失函数主要由两部分组成，一部分使teacher模型经过蒸馏温度\(T\)后得到soft-loss，还有自己模型与真实标签的普通训练后得到的hard-loss。总的loss = soft-loss + hard-loss。

\[ Loss = (1-\alpha) T^2 L_{soft} + \alpha L_{hard} \]

我们发现最好的结果是在第二个目标函数上使用较低的权值（\(\alpha\)较小）。由于软目标产生的梯度缩放（\(1/T^{2}\)），因而在同时使用软目标和硬目标时，将软目标的梯度乘以\(T^{2}\)是非常重要的。这确保了在实验过程中，如果用于蒸馏的温度发生改变，那么硬目标和软目标的相对贡献大致保持不变。

假设P为教师模型的软标签，Q为学生模型的预测，则两者的KL散度作为loss：

\[ D_{KL}(P||Q) = \sum_{i} P_{i} \log (\frac{P_{i}}{Q_{i}}) = \sum_{i} P_{i} (\log(P_i) - \log(Q_i)) \]

class DistillKL(nn.Module):
    """Distilling the Knowledge in a Neural Network"""
    def __init__(self, T):
        super(DistillKL, self).__init__()
        self.T = T # 蒸馏温度

    def forward(self, y_s, y_t):
        '''
        y_s: student model output
        y_t: teacher model output
        '''
        p_s = F.log_softmax(y_s/self.T, dim=1)
        p_t = F.softmax(y_t/self.T, dim=1)
        loss = F.kl_div(p_s, p_t, size_average=False) * (self.T**2) / y_s.shape[0]
        return loss

简洁实现（nn.labml.ai）

import torch
import torch.nn.functional
from torch import nn

class Configs(CIFAR10Configs):
    """
    ## Configurations

    This extends from [`CIFAR10Configs`](../experiments/cifar10.html) which defines all the
    dataset related configurations, optimizer, and a training loop.
    """
    # The small model
    model: SmallModel
    # The large model
    large: LargeModel
    # KL Divergence loss for soft targets
    kl_div_loss = nn.KLDivLoss(log_target=True) # KL散度损失
    # Cross entropy loss for true label loss
    loss_func = nn.CrossEntropyLoss()
    # Temperature, $T$
    temperature: float = 5.
    # Weight for soft targets loss.
    #
    # The gradients produced by soft targets get scaled by $\frac{1}{T^2}$.
    # To compensate for this the paper suggests scaling the soft targets loss
    # by a factor of $T^2$
    soft_targets_weight: float = 100.
    # Weight for true label cross entropy loss
    label_loss_weight: float = 0.5

    def step(self, batch: any, batch_idx: BatchIndex):
        """
        ### Training/validation step

        We define a custom training/validation step to include the distillation
        """

        # Training/Evaluation mode for the small model
        self.model.train(self.mode.is_train)
        # Large model in evaluation mode
        self.large.eval()

        # Move data to the device
        data, target = batch[0].to(self.device), batch[1].to(self.device)

        # Update global step (number of samples processed) when in training mode
        if self.mode.is_train:
            tracker.add_global_step(len(data))

        # Get the output logits, $v_i$, from the large model
        with torch.no_grad():
            large_logits = self.large(data)

        # Get the output logits, $z_i$, from the small model
        output = self.model(data)

        # Soft targets
        # $$p_i = \frac{\exp (\frac{v_i}{T})}{\sum_j \exp (\frac{v_j}{T})}$$
        soft_targets = nn.functional.log_softmax(large_logits / self.temperature, dim=-1)
        # Temperature adjusted probabilities of the small model
        # $$q_i = \frac{\exp (\frac{z_i}{T})}{\sum_j \exp (\frac{z_j}{T})}$$
        soft_prob = nn.functional.log_softmax(output / self.temperature, dim=-1)

        # Calculate the soft targets loss
        soft_targets_loss = self.kl_div_loss(soft_prob, soft_targets)
        # Calculate the true label loss
        label_loss = self.loss_func(output, target)
        # Weighted sum of the two losses
        loss = self.soft_targets_weight * soft_targets_loss + self.label_loss_weight * label_loss
        # Log the losses
        tracker.add({"loss.kl_div.": soft_targets_loss,
                     "loss.nll": label_loss,
                     "loss.": loss})

        # Calculate and log accuracy
        self.accuracy(output, target)
        self.accuracy.track()

        # Train the model
        if self.mode.is_train:
            # Calculate gradients
            loss.backward() 
            # Take optimizer step
            self.optimizer.step()
            # Log the model parameters and gradients on last batch of every epoch
            if batch_idx.is_last:
                tracker.add('model', self.model)
            # Clear the gradients
            self.optimizer.zero_grad()

        # Save the tracked metrics
        tracker.save()


@option(Configs.large)
def _large_model(c: Configs):
    """
    ### Create large model
    """
    return LargeModel().to(c.device)


@option(Configs.model)
def _small_student_model(c: Configs):
    """
    ### Create small model
    """
    return SmallModel().to(c.device)


def get_saved_model(run_uuid: str, checkpoint: int):
    """
    ### Load [trained large model](large.html)
    """

    from labml_nn.distillation.large import Configs as LargeConfigs

    # In evaluation mode (no recording)
    experiment.evaluate()
    # Initialize configs of the large model training experiment
    conf = LargeConfigs()
    # Load saved configs
    experiment.configs(conf, experiment.load_configs(run_uuid))
    # Set models for saving/loading
    experiment.add_pytorch_models({'model': conf.model})
    # Set which run and checkpoint to load
    experiment.load(run_uuid, checkpoint)
    # Start the experiment - this will load the model, and prepare everything
    experiment.start()

    # Return the model
    return conf.model


def main(run_uuid: str, checkpoint: int):
    """
    Train a small model with distillation
    """
    # Load saved model
    large_model = get_saved_model(run_uuid, checkpoint)
    # Create experiment
    experiment.create(name='distillation', comment='cifar10')
    # Create configurations
    conf = Configs()
    # Set the loaded large model
    conf.large = large_model
    # Load configurations
    experiment.configs(conf, {
        'optimizer.optimizer': 'Adam',
        'optimizer.learning_rate': 2.5e-4,
        'model': '_small_student_model',
    })
    # Set model for saving/loading
    experiment.add_pytorch_models({'model': conf.model})
    # Start experiment from scratch
    experiment.load(None, None)
    # Start the experiment and run the training loop
    with experiment.start():
        conf.run()


#
if __name__ == '__main__':
    main('d46cd53edaec11eb93c38d6538aee7d6', 1_000_000)