Gradient Inversion Attack

Published March 24, 2025


This is a replication of the paper Deep Leakage from Gradients.

The gist of the paper is that it is possible to reverse-engineer the training data from the gradients of a neural network in training.

Given some machine learning model $f(w, x)$, and some ground truth training data $(x, y)$, we can use the gradients of the model to reconstruct the training data.

To do this, we initialize a random image $x_i$, and a random label $y_i$.

We then are given the gradient of the weights of the model, $\nabla W$

We can then run inference on the new training data to get the gradients of the new weights, $\nabla W_i$

We then compute a new loss function $L = ||W_i - W||^2$

We then find $\frac{\partial L}{\partial x_i}$ and $\frac{\partial L}{\partial y_i}$ to update our image and labels.

For this, we only care about the image, and it works beautifully!

Reference image: Reference image

Reconstructed image: Reconstructed image

For some reason, there is a 3x3 tiling here, which I can't figure out.