Tuning (Variational) Autoencoder architectures with Optuna

Max Joas
12 min readJul 29, 2023

--

TL;DR

You can use Optuna to fine-tune the architecture of your autoencoder models. The code is here.

Why should you care?

In the image below you see three different autoencoder architectures. One is not superior to the other. It depends on the task at hand. However, you often don’t know beforehand which architecture to use. Of course, you can go with your gut and experience, but there is also a more structured way.

Figure1: Schematic autoencoder architectures

You can use hyperparameter tuning to find your model architecture.

It is common to tune hyperparameters (learning rate, dropout rate) but is less used for model architecture. In this article, I’ll quickly show how to use the Optuna framework to tune the model architecture of autoencoders and variational autoencoders.

This can be tricky for autoencoders at some points, but we’ll cover that. So let’s start

What is Optuna?

[Optuna](https://optuna.org) is a Python library for hyperparameter optimization. You can use it for any deep-learning framework. I’ll use PyTorch for this example, but it does not matter.

You can install Optuna by running:

pip install optuna

What are autoencoders?

Since you probably googled something like ‘hyperparameter tuning for autoencoder’, I assume you’re familiar with autoencoders. So skip this step and leave you with a great video by Luis Serrano, which gives an intuitive understanding of the topic.

High-level structure

Objective function, and trials

The building blocks of Optuna are an objective function and a trial. The objective function tries to minimize a criterion (mostly some kind of loss) by trying out different configurations, named trials.

The trial object can suggest different values for hyperparameters in a given rate:

trial.suggest_float("lr", 0.00001, 0.001)

We can also use this principle to suggest the number of layers and number and layer sizes in our autoencoder network.

When constructing our model architecture, we loop over the suggested number of layers and add them to our model architecture. Technically this looks like this:

Classes

First, we define a basic configuration dictionary for the trial object to suggest from, and show the class architecture *(full code at the end of the article)*

import torch.nn as nn
# This can come from a YAML file in a production environment
# In the config dict we set the range of values for the trial object to suggest from.
cfg = {"LAYERS_LOWER_LIMIT": 2,
"LAYERS_UPPER_LIMIT": 6,
"DROPOUT_LOWER_LIMIT": 0.05,
"DROPOUT_UPPER_LIMIT": 0.2,
'LR_LOWER_LIMIT': 0.001,
'LR_UPPER_LIMIT': 0.05,
'EPOCHS': 50}
class AETune(nn.Module):
def __init__(self, trial, input_dim, cfg):
super(AETune, self).__init__()
self.trial = trial # optuna trial object
self.cfg = cfg # configuration dict
self.input_dim = input_dim # number of features for input data
self.n_layers = self.trial.suggest_int('n_layers',
cfg['LAYERS_LOWER_LIMIT'],
cfg['LAYERS_UPPER_LIMIT'])
encoder_layers = [] # here we save all the layers depending on layer size
prev_dim_size = [self.input_dim] # Here we save the size the previous layer, initialized with input_dim

Encoder and decoder logic

The setup above should look familiar. Now we will implement the logic for tuning the number and size of layers. I’ll add extensive comments to the code snippet for better comprehension. We are still in the __init__ method of the above class:

# ENCODER ------------------------------------------------------------
for i in range(self.n_layers):
prev_layer_size = prev_dim_size[-1] # equals the number of input features in the first iteration
# Next we want the trial object to suggest the size of the next layer
# Therefore, we need to set a range for the object to choose from
# We decide to make the next layer size dependent on the previous layer size
# so we decide to set the upper limit to 1/4 of the previous and 1/128 for the lower limit.
# These limits were set arbitrarily
# Lastly, we ensure that the layer size never gets below 4 (also arbitrarily set and should depend on data)
lower = max(int(prev_layer_size / 128), 4)
upper = max(int(prev_layer_size) / 4, 4)
            next_layer_size = self.trial.suggest_int(
f'n_units_l{i}', lower, upper)
# we save the layers to a list to later put them together in one Sequential layer after the for loop
encoder_layers.append(nn.Linear(prev_layer_size, next_layer_size))
prev_dim_size.append(next_layer_size) # as reference for the size of the next layer
# we don't want to have a ReLU or dropout layer right before the latent space
if not i + 1 == self.n_layers:
encoder_layers.append(nn.ReLU())
p = self.trial.suggest_float(f'dropout_l{i}',
cfg['DROPOUT_LOWER_LIMIT'],
cfg['DROPOUT_UPPER_LIMIT'])
encoder_layers.append(nn.Dropout(p))
self.encoder = nn.Sequential(*encoder_layers)

Now we have our encoder structure in place. For the decoder, we need to build the layers in the reverse order of the encoder:

# DECODER -------------------------------------------------------------
decoder_layers = [] # Again, we save all the layers to put them into the Sequential object later
# We iterate the encoder layers in reverse order
for layer in encoder_layers[::-1]:
if not isinstance(layer, nn.Linear):
# We need to skip this because in the next lines, we assume that each layer
# has the attributes `out_features` and in_features`
# To have the same structure as the decoder we add the activation and dropout after adding the Linear layer.
continue # skipping dropout and activation layers
if layer.in_features == self.input_dim: # sigmoid for last layer
decoder_layers.append(
nn.Linear(layer.out_features, layer.in_features))
decoder_layers.append(nn.Sigmoid())
continue
# Because of the x-shaped architecture of the autoencoder
# the number of in-features for the encoder, are the number of out-features for the decoder and vice versa
decoder_layers.append(
nn.Linear(layer.out_features, layer.in_features))
decoder_layers.append(nn.ReLU())
p = self.trial.suggest_float(f'dropout_l{i}',
cfg['DROPOUT_LOWER_LIMIT'],
cfg['DROPOUT_UPPER_LIMIT'])
            decoder_layers.append(nn.Dropout(p))
self.decoder = nn.Sequential(*decoder_layers)

Now we’ve built the encoder and the decoder architectures by letting the trial object suggest how many layers we should use and how many nodes each layer should have. We only specified a range of values to suggest. Then we used loops to generate all layers and set them together with the Sequential class. We’ll add the logic to actually run the encoder and decoder to the class and finally get:

import torch.nn.functional as F
class AETune(nn.Module):
def __init__(self, trial, input_dim, cfg):
super(AETune, self).__init__()
self.trial = trial # optuna trial object
self.cfg = cfg # configuration dict
self.input_dim = input_dim # number of features for input data
        self.n_layers = self.trial.suggest_int('n_layers',
cfg['LAYERS_LOWER_LIMIT'],
cfg['LAYERS_UPPER_LIMIT'])
print(self.n_layers)
encoder_layers = [] # Here we save all the layers depending on layer size
prev_dim_size = [self.input_dim] # Here we save the size of the previous layer, initialized with input_dim
# ENCODER ------------------------------------------------------------
for i in range(self.n_layers):
prev_layer_size = prev_dim_size[-1] # equals the number of input features in the first iteration
# Next we want
# trial object to suggest the size of the next layer
# therefore we need to set a range for the object to choose from
# We decide to make the next layer size dependend on the previous layer size
# so we decide to set the upper limit to 1/4 of the previous and 1/128 for the lower limit.
# These limits were set arbitrarily
# Lastly, we ensure that the layer size never gets below 4 (also arbitrarily set)
lower = max(int(prev_layer_size / 128), 4)
upper = max(int(prev_layer_size) / 4, 4)
next_layer_size = self.trial.suggest_int(
f'n_units_l{i}', lower, upper)
# We save the layers to a list to later put them together in one Sequential layer after the for loop
encoder_layers.append(nn.Linear(prev_layer_size, next_layer_size))
prev_dim_size.append(next_layer_size) # as reference for the size of the next layer
# we don't want to have a ReLU or dropout layer right before the latent space
if not i + 1 == self.n_layers:
encoder_layers.append(nn.ReLU())
p = self.trial.suggest_float(f'dropout_l{i}',
cfg['DROPOUT_LOWER_LIMIT'],
cfg['DROPOUT_UPPER_LIMIT'])
encoder_layers.append(nn.Dropout(p))
self.encoder = nn.Sequential(*encoder_layers)
# DECODER -------------------------------------------------------------
decoder_layers = [] # Again, we save all the layers to put them into the Sequential object later
# We iterate the encoder layers in reverse order
for layer in encoder_layers[::-1]:
if not isinstance(layer, nn.Linear):
# We need to skip this because in the next lines we assume that each layer
# has the attributes `out_features` and in_features`
# To have the same structure as the decoder we add the activation and dropout after adding the Linear layer.
continue # skipping dropout and activation layers
if layer.in_features == self.input_dim: # sigmoid for the last layer
decoder_layers.append(
nn.Linear(layer.out_features, layer.in_features))
decoder_layers.append(nn.Sigmoid())
continue
# because of the x-shaped architecture of the autoencoder
# the number of in-features for the encoder, are the number of out-features for the decoder and vice versa
decoder_layers.append(
nn.Linear(layer.out_features, layer.in_features))
decoder_layers.append(nn.ReLU())
p = self.trial.suggest_float(f'dropout_l{i}',
cfg['DROPOUT_LOWER_LIMIT'],
cfg['DROPOUT_UPPER_LIMIT'])
decoder_layers.append(nn.Dropout(p))
self.decoder = nn.Sequential(*decoder_layers)
def encode(self, x):
latent = F.relu(x)
return self.encoder(latent)
def decode(self, x):
return self.decoder(x)
def forward(self, x):
latent = self.encode(x)
recon = self.decode(latent)
# return None to be able to use the same training loop as for the VAE
return recon, latent, None

Before we get to the objective function, I want to extend the autoencoder class to a variational autoencoder. The code for this looks similar to the autoencoder above, so we won’t cover all the details, only the differences between them.

For the variational autoencoder, we implement the standard case of fitting the parameters of the normal distribution, mean, and variance (we use log variance). So the last layer of the encoder architecture will be the logvar and mean layer.

import torch
import torch.nn as nn
class VAETune(nn.Module):
def __init__(self, trial, input_dim, cfg):
super(VAETune, self).__init__()
self.trial = trial # optuna trial object
self.cfg = cfg # configuration dict
self.input_dim = input_dim # number of features for input data
self.n_layers = self.trial.suggest_int('n_layers',
cfg['LAYERS_LOWER_LIMIT'],
cfg['LAYERS_UPPER_LIMIT'])
print(self.n_layers)
encoder_layers = [] # Here we save all the layers depending on layer size
prev_dim_size = [self.input_dim] # Here we save the size of the previous layer, initialized with input_dim
# ENCODER ------------------------------------------------------------
for i in range(self.n_layers):
prev_layer_size = prev_dim_size[-1] # equals the number of input features in the first iteration
# Next we want the trial object to suggest the size of the next layer
# Therefore we need to set a range for the object to choose from
# We decide to make the next layer size dependent on the previous layer size
# so we decide to set the upper limit to 1/4 of the previous and 1/128 for the lower limit.
# These limits were set arbitrarily
# Lastly, we ensure that the layer size never gets below 4 (also arbitrarily set)
lower = max(int(prev_layer_size / 128), 4)
upper = max(int(prev_layer_size) / 4, 4)
next_layer_size = self.trial.suggest_int(
f'n_units_l{i}', lower, upper)
# We save the layers to a list to later put them together in one Sequential layer after the for loop
encoder_layers.append(nn.Linear(prev_layer_size, next_layer_size))
prev_dim_size.append(next_layer_size) # as reference for the size of the next layer
# we don't want to have a ReLU or dropout layer right before the latent space
if not i + 1 == self.n_layers:
encoder_layers.append(nn.ReLU())
p = self.trial.suggest_float(f'dropout_l{i}',
cfg['DROPOUT_LOWER_LIMIT'],
cfg['DROPOUT_UPPER_LIMIT'])
encoder_layers.append(nn.Dropout(p))
self.encoder = nn.Sequential(*encoder_layers) # split the last layer into vectors (mean, and log variance for VAE)
self.mu = encoder_layers[-1]
# Remove from the list to not use twice in self.encode
self.logvar = encoder_layers.pop()
encoder_layers.append(nn.ReLU())
self.encoder = nn.Sequential(*encoder_layers)
# For decoder append the last layer again to reverse construct encoder
encoder_layers.append(self.mu)
# DECODER -------------------------------------------------------------
decoder_layers = [] # Again, we save all the layers to put them into the Sequential object later
# We iterate the encoder layers in reverse order
for layer in encoder_layers[::-1]:
if not isinstance(layer, nn.Linear):
# We need to skip this because in the next lines we assume that each layer
# has the attributes `out_features` and in_features`
# To have the same structure as the decoder we add the activation and dropout after adding the Linear layer.
continue # skipping dropout and activation layers
if layer.in_features == self.input_dim: # sigmoid for the last layer
decoder_layers.append(
nn.Linear(layer.out_features, layer.in_features))
decoder_layers.append(nn.Sigmoid())
continue
# because of the x-shaped architecture of the autoencoder
# the number of in-features for the encoder, are the number of out-features for the decoder and vice versa
decoder_layers.append(
nn.Linear(layer.out_features, layer.in_features))
decoder_layers.append(nn.ReLU())
p = self.trial.suggest_float(f'dropout_l{i}',
cfg['DROPOUT_LOWER_LIMIT'],
cfg['DROPOUT_UPPER_LIMIT'])
decoder_layers.append(nn.Dropout(p))
self.decoder = nn.Sequential(*decoder_layers)
def encode(self, x):
latent = self.encoder(x)
return self.mu(latent), self.logvar(latent)
def decode(self, z):
return self.decoder(z)

def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std
def forward(self, x):
mu, logvar = self.encode(x.view(-1, self.input_dim))
z = self.reparameterize(mu, logvar)
return self.decode(z), mu, logvar

### Objective function

Now that we have set the structure, we need to define the objective function and an Optuna study that will run all trials and finds the best trial. First, we create the study and tell Optuna if the loss should be minimized or maximized:

# standard optuna procedure
study = optuna.create_study(direction="minimize")

Next, we run the optimization step and lastly select the best trial:

study.optimize(
lambda trial: objective(cfg, trial, dataloader),
n_trials=30
)
# get best trial
best_trial = study.best_trial

We can pass the best_trial to the Autoencoder classes as defined above. But first, you may have noticed that we use a lambda expression to call the objective function. The objective function is basically a training loop wrapper with the trial object suggesting hyperparameters.

import optuna
import torch.optim as optim
import torch.nn.functional as F
def objective(cfg, trial, dataloader, input_dim, model_name):
# Define the range of hyperparameters to search for
optimizer_name = trial.suggest_categorical("optimizer", ["Adam", "RMSprop", "SGD"])
lr = trial.suggest_float("lr", cfg["LR_LOWER_LIMIT"], cfg["LR_UPPER_LIMIT"], log=False)
weight_decay = trial.suggest_float("weight_decay", 0.0, 0.1)

# Instantiate the model
model = AETune(trial, input_dim, cfg) if model_name =='AE' else VAETune(trial, input_dim, cfg)

# Get the optimizer object based on the selected optimizer name
optimizer = getattr(optim, optimizer_name)(model.parameters(), lr=lr, weight_decay=weight_decay)

# Standard PyTorch training loop
for epoch in range(cfg["EPOCHS"]):
total_loss = 0.0

for batch, _ in dataloader:
# Forward pass
reconstr_x, mu, logvar = model(batch)

# Compute the batch loss
kl_loss = (-0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())) if model_name == 'VAE' else torch.tensor(0)
batch_loss = F.binary_cross_entropy(reconstr_x, batch.view(-1, input_dim), reduction='sum') + kl_loss

# Perform backpropagation and optimization step
optimizer.zero_grad()
batch_loss.backward()

# Apply gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)

optimizer.step()

# Accumulate the batch loss
total_loss += batch_loss.item()

# Report the average loss for the epoch
trial.report(total_loss, epoch)

# Handle pruning based on the intermediate value
if trial.should_prune():
raise optuna.exceptions.TrialPruned()

return total_loss

Short practical example

With that, you hopefully have a concept of how to tune the architecture of your autoencoder model. Lastly, I’ll share a simple example. We will do the following:

1. Create a fake dataset

2. Tune the model architecture and other hyperparameters

3. Visualize the result

Firstly, we define a function to generate a fake dataset. We use the function `make_blobs` for this. The reason for this is that we want to show that our autoencoder works for dimension reduction while keeping the essence of the data.

In plain English: we want to see the blobs, that we create also in the latent space.

from torch.utils.data import DataLoader, TensorDataset, random_split
# import make_blobs for sklearn
from sklearn.datasets import make_blobs
import torch
def create_fake_data(input_dim, n_samples):
""" creates a fake dataset with scikit learn make_blobs and puts it into a torch dataloader
ARGS:
input_dim (int): number of features
n_samples (int): number of samples
RETURNS:
validloader (torch.DataLoader)
testloader (torch.DataLoader)
trainloader (torch.DataLoader)
"""
    X, y = make_blobs(n_samples=n_samples,
n_features=input_dim,
centers=5,
cluster_std=5,
random_state=42)
# min max scaling
X = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
# X = torch.tensor(X, dtype=torch.float32)
# split in train, test and valid set
train_ratio = 0.6
test_ratio = 0.2
valid_ratio = 0.2
dataset = TensorDataset(torch.tensor(X, dtype=torch.float32), torch.tensor(y))
trainset, testset, validset = random_split(dataset,[int(n_samples * train_ratio), int(n_samples*test_ratio), int(n_samples*valid_ratio)])
trainloader = DataLoader(trainset, batch_size=32)
testloader = DataLoader(testset, batch_size=32)
validloader = DataLoader(validset, batch_size=32)
return validloader, testloader, trainloader

Next, we create the data and use the validation dataloader to tune the model architecture with Optuna.

input_dim = 500
n_samples = 10000
model_name = 'VAE'
validloader, testloader, trainloader = create_fake_data(input_dim, n_samples)
study = optuna.create_study(direction="minimize")
study.optimize(lambda trial: objective(cfg, trial, validloader, input_dim, model_name), n_trials=5)

Best Model

Let’s take a look at the found architecture

best_trial = study.best_trial
best_model = VAETune(best_trial, input_dim, cfg) if model_name == 'VAE' else AETune(best_trial, input_dim, cfg)
print(best_model)
### OUTPUT ### --------------------------------------------------------------
AETune(
(encoder): Sequential(
(0): Linear(in_features=500, out_features=116, bias=True)
(1): ReLU()
(2): Dropout(p=0.059983165269794206, inplace=False)
(3): Linear(in_features=116, out_features=22, bias=True)
(4): ReLU()
(5): Dropout(p=0.18001386479475678, inplace=False)
(6): ReLU()
)
(mu): Linear(in_features=22, out_features=4, bias=True)
(logvar): Linear(in_features=22, out_features=4, bias=True)
(decoder): Sequential(
(0): Linear(in_features=4, out_features=22, bias=True)
(1): ReLU()
(2): Dropout(p=0.17589635676117815, inplace=False)
(3): Linear(in_features=22, out_features=116, bias=True)
(4): ReLU()
(5): Dropout(p=0.17589635676117815, inplace=False)
(6): Linear(in_features=116, out_features=500, bias=True)
(7): Sigmoid()
)
)

We’ve found our model architecture, so we will use our training data to train the model weights:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
# Train model weights with the best hyperparameters and train data
optimizer = getattr(optim, best_trial.params['optimizer'])(best_model.parameters(), lr=best_trial.params['lr'], weight_decay=0.05)
for epoch in range(100):
best_model.train()
loss = 0
batchcount = 0
for batch, _ in trainloader:
batchcount += 1

optimizer.zero_grad()
recon, mu, logvar = best_model(batch)
beta = 0.1
kl_loss = beta * (-0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())) if model_name == 'VAE' else torch.tensor(0)
batch_loss = F.binary_cross_entropy(recon, batch.view(-1,input_dim), reduction='sum') + kl_loss
batch_loss.backward()
torch.nn.utils.clip_grad_norm_(best_model.parameters(), 1)
optimizer.step()
loss += batch_loss.detach()
if epoch % 20 == 0:
print(f'loss: {loss}')

Visual proof

Lastly, we want to check if the dimension reduction of the latent space is kept. Therefore we perform a PCA on the input data and the latent space. Since we used `make_blobs` to generate the data, we should see the clusters.

If the dimension reduction contains the information we should see similar clusters when we perform PCA on the latent dim. In this example, we use the variational autoencoder. Hence we expect the latent space to be more dense.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA(n_components=2)
pca.fit_transform(testloader.dataset[:][0].detach().numpy())
labels = testloader.dataset.dataset[:][1][testloader.dataset.indices].detach().numpy()
pca_data = pca.transform(testloader.dataset[:][0].detach().numpy())
print(pca_data.shape)
# visualize pca
fig, ax = plt.subplots()
ax.scatter(pca_data[:, 0], pca_data[:, 1], c=labels)
# save fig
plt.show()
Figure 2: scatterplot of first two principle components of input data
# test model on test data
latent_space = best_model.encode(testloader.dataset[:][0])
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
# pca on latent space
pca = PCA(n_components=2)
pca.fit(latent_space[:][0].detach().numpy())
pca_data = pca.transform(latent_space[:][0].detach().numpy())
# visualize pca
fig, ax = plt.subplots()
ax.scatter(pca_data[:, 0], pca_data[:, 1], c=labels)
# show fig
plt.show()
Figure 2: scatterplot of first two principle components of latent space

We see that the samples of the same cluster are still in spatial proximity to each other. This indicates that the information on cluster class membership is still contained in the latent space.

Acknowledgments

When I first wanted to know how to tune autoencoders with Optuna I found this article by Maksim Denisov very helpful. I needed a few things he did not cover, so I wrote this article. Hopefully, someone finds this helpful.

--

--