# Inference in Variational Autoencoders with Different Monte Carlo Sample Sizes

In a previous post, I demonstrated how to leverage Keras' modular design to implement variational autoencoders in a way that makes it easy to tweak hyperparameters, adapt to it to other related models, and extend it to the more sophisticated methods proposed in the current research.

Recall that we optimize the generally intractable evidence lower bound (ELBO) using reparameterization gradients, which approximates the expectation of gradients with Monte Carlo (MC) samples. In their original paper, Kingma and Welling (2014) [1] remark that an MC sample size of 1 is adequate for a sufficiently large batch size (~100). Obviously, this is highly dependent on the problem (more specifically the likelihood). In general, it is important to experiment with different MC sample sizes and observe the various effects it has on training stability. In this short post, we demonstrate how to tweak the MC sample size under our basic framework.

Thanks to broadcasting, the changes required are very minimal, but have vast implications. The change we make is as simple as:

eps = Input(shape=(mc_samples, latent_dim))


That is, we make the shape of the noise from the base distribution (batch_size, mc_samples, latent_dim). This is equivalent to drawing mc_samples of noise vectors for each observation in the batch, rather than just a single sample.

Everything else in our model specification can remain exactly the same, since the Multiply layer will automatically broadcast:

• eps of shape (batch_size, mc_samples, latent_dim) with
• sigma of shape (batch_size, latent_dim)

and thereby output tensor of shape (batch_size, mc_samples, latent_dim). Similarly the Add layer will automatically broadcast:

• the previous output of shape (batch_size, mc_samples, latent_dim) with
• mu of shape (batch_size, latent_dim)

to finally output latent variables z with shape (batch_size, mc_samples, latent_dim), corresponding to mc_samples of latent vectors with length latent_dim for every observation in the batch. This is illustrated in the figure below.

Reparameterization with latent_dim=2, mc_samples=25.

Exactly as before, we specify the output of the variational autoencoder as the output of the latent variable z fed through some generative model (a deep latent Gaussian model),

decoder = Sequential([
Dense(intermediate_dim, input_dim=latent_dim, activation='relu'),
Dense(original_dim, activation='sigmoid')
])

x_mean = decoder(z)

vae = Model(inputs=[x, eps], outputs=x_mean)
vae.compile(optimizer='rmsprop', loss=nll)


Note the specification of input_dim=latent_dim. It tells this and all subsequent layers to operate only on this dimension. Hence, for each observation, we sample mc_samples latent variables, and propagate these through the generative model to obtain mc_samples predictions/observations. Please see the figure below.

Reparameterization with latent_dim=2, mc_samples=25. For each input observation, we output mc_samples reconstructions.

In particular, notice that the input shape for each observation x in the batch is original_dim = 784 (28 * 28), and that the output for each observation in the batch has shape (25, 784), corresponding to mc_samples = 25 samples from the predictive distribution. Lastly, observe that until the Multiply layer, all inputs and outputs were rank 2 tensors, consisting of a variable batch_size dimension, and a feature dimension. The MC sample dimension is introduced by the eps noise input layer, which has shape (mc_samples, latent_dim) = (25, 2), and is propagated throughout all subsequent layers.

## Model fitting

At this stage, it is important to recognize the distinction between the log likelihood of the mean output, versus the mean of the log likelihood over the outputs. Since we are interested in estimating the expected log likelihood over the approximate posterior distribution, we require the latter.

Now, because the output of our model is now a rank 3 tensor, to use methods like fit and evaluate, we must ensure the targets are of a shape that can broadcast with the shape of our output, namely (n_samples, mc_samples, original_dim). This is easily achieved by adding a dimension to the target array with

np.expand_dims(x_train, axis=1)


which has shape (n_samples, 1, original_dim). Now the loss function can broadcast this with the model output to yield (n_samples, mc_samples) loss values. Methods like fit and evaluate will automatically aggregate this into a single scalar loss value, e.g.

>>> vae.evaluate(x_test,
...              np.expand_dims(x_test, axis=1),
...              batch_size=batch_size)
10000/10000 [==============================] - 0s 43us/step
543.99742309570308


Finally, fitting the model simply consists of:

vae.fit(
x_train,
np.expand_dims(x_train, axis=1),
shuffle=True,
epochs=epochs,
batch_size=batch_size,
validation_data=(
x_test,
np.expand_dims(x_test, axis=1)
)
)


Warning

Keras 2.1.0 introduced breaking changes which tightens the constraint on the targets and the predicted outputs to have exactly the same shape. This is not a showstopper, since we can just tile the array across the MC sample dimension/channel,

np.tile(np.expand_dims(x_test, axis=1),
reps=(1, mc_samples, 1))


or equivalently,

np.rollaxis(np.tile(x_test, reps=(mc_samples, 1, 1)), axis=1)


This is neither as slick nor as space efficient, but it gets the job done.

## Distribution over Reconstructions

Let's choose an arbitrary observation from the test set and feed it through our autoencoder model vae. This yields mc_samples=25 samples from the predictive distribution over reconstructions.

>>> x = x_test[0] # choose arbitrary observation
>>> recons = np.squeeze(vae.predict(np.atleast_2d(x)))
>>> recons.shape
(25, 784)


We can visualize these:

plt.figure(figsize=(6, 4))
plt.imshow(x.reshape(28, 28), cmap='gray')
plt.imshow(np.block(list(map(list, recons.reshape(5, 5, 28, 28)))),
cmap='gray')
plt.show()


The output of this is shown in the figure below. You may need to squint closely to see that the sampled reconstructions are different to each other.

5x5 grid reconstructions for a given observation.

As a sanity check,

>>> np.all(recons[0] == recons[-1])
False
>>> np.all(recons[1:] == recons[:-1], axis=1)
array([False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False], dtype=bool)


## Summary

In this post, we demonstrated how simple it is to extend our basic framework to allow for specification of arbitrary Monte Carlo samples sizes. We simply leveraged Keras' ability to broadcast inputs with its layers and let it propagate the additional MC sample channel/dimension to the final output. Next, we applied a simple trick so that the target array broadcasts with the final output, which allows us to approximate the expected log likelihood using the Monte Carlo samples. Finally, we demonstrated how we can use our fitted model to obtain a distribution over reconstructions. This approach is appealing not only for its simplicity, but its applicability to large class of problems with various likelihoods.

In a future post, we will use methods discussed here to implement and explore Importance Weighted Autoencoders [2], which uses importance sampling to approximate the ELBO.

## References

 [1] D. P. Kingma and M. Welling, "Auto-Encoding Variational Bayes," in Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2014.
 [2] Y. Burda, R. Grosse, and R. Salakhutdinov, "Importance Weighted Autoencoders," in Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015.

## Appendix

Below you can find: