Working with Pandas MultiIndex Dataframes: Reading and Writing to CSV and HDF5
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Rationale¶
For some certain loss functions, such the the negative evidence lower bound (NELBO) in variational inference, they are generally analytically intractable and thus unavailable in closed-form. As such, we might need to resort to taking stochastic estimates of the loss function. In these situations, it is very important to study and understand the robustness of the estimations we are making, particularly in terms of bias and variance. When proposing a new estimator, we may be interested in evaluating the loss at a fined-grained level - not only per batch, but perhaps even per data-point.
This notebook explores storing the recorded losses in Pandas Dataframes. The recorded losses are 3d, with dimensions corresponding to epochs, batches, and data-points. Specifically, they are of shape (n_epochs, n_batches, batch_size)
. Instead of using the deprecated Panel functionality from Pandas, we explore the preferred MultiIndex Dataframe.
Lastly, we play around with various data serialization formats supported out-of-the-box by Pandas. This might be useful if the training is GPU-intensive, so the script runs and records the loss remotely on a supercomputer, and we must write the results to file, download them and finally analyze them locally. This is usually trivial, but it is unclear what the behaviour is for more complex MultiIndex dataframes. We restrict our attention to the CSV format, which is human-friendly but very slow and inefficient, and the HDF5, which is basically diametrically opposed - it's basically completely inscrutable, but is very fast and takes up laess space.
Synthetic Data¶
# create some noise
a = np.random.randn(50, 600, 100)
a.shape
# create some noise with higher variance and add bias.
b = 2. * np.random.randn(*a.shape) + 1.
b.shape
# manufacture some loss function
# there are n_epochs * n_batchs * batch_size
# recorded values of the loss
loss = 10 / np.linspace(1, 100, a.size)
loss.shape
MultiIndex Dataframe¶
# we will create the indices from the
# product of these iterators
list(map(range, a.shape))
# create the MultiIndex
index = pd.MultiIndex.from_product(
list(map(range, a.shape)),
names=['epoch', 'batch', 'datapoint']
)
# create the dataframe that records the two losses
df = pd.DataFrame(
dict(loss1=loss+np.ravel(a),
loss2=loss+np.ravel(b)),
index=index
)
df
Visualization¶
In this contrived scenario, loss2
is more biased and has higher variance.
# some basic plotting
fig, ax = plt.subplots()
df.groupby(['epoch', 'batch']).mean().plot(ax=ax)
plt.show()
CSV Read/Write¶
%%time
df.to_csv('losses.csv')
!ls -lh losses.csv
%%time
df_from_csv = pd.read_csv('losses.csv', index_col=['epoch', 'batch', 'datapoint'], float_precision='high')
# does not recover exactly due to insufficient floating point precision
df_from_csv.equals(df)
# but it has recovered it up to some tiny epsilon
((df-df_from_csv)**2 < 1e-25).all()
HDF5 Read/Write¶
HDF5 writing is orders of magnitude faster.
%%time
df.to_hdf('store.h5', key='losses')
Furthermore, the file sizes are significantly smaller.
!ls -lh store.h5
%%time
df_from_hdf = pd.read_hdf('store.h5', key='losses')
Lastly, it is far more numerical precise.
df.equals(df_from_hdf)