Module1

Generative AI (GenAI) is the latest subtype of AI that broadly describes Machine Learning (ML) models or algorithms.

Difference Between Traditional AI and Generative AI

Traditional AI Generative AI
AI is used to create intelligent systems that can perform those tasks which generally require human intelligence. It generates new text, audio, video, or any other type of content by learning patterns from existing training data.
The purpose of AI algorithms or models are to mimic human intelligence across wide range of applications. The purpose of generative AI algorithms or models is to generate new data having similar characteristics as data from the original dataset.

Traditional AI	Generative AI
AI is used to create intelligent systems that can perform those tasks which generally require human intelligence.	It generates new text, audio, video, or any other type of content by learning patterns from existing training data.
The purpose of AI algorithms or models are to mimic human intelligence across wide range of applications.	The purpose of generative AI algorithms or models is to generate new data having similar characteristics as data from the original dataset.

Overview of Generative Adversarial Network

GAN stands for Generative Adversarial Network, and it is a class of artificial intelligence algorithms used in machine learning and deep learning for generating data. GANs were introduced by Ian Goodfellow and his colleagues in 2014 and have since become a popular and powerful tool in various applications, including image generation, text generation, and more.

How does a GAN work?

GANs train by having two networks the Generator (G) and the Discriminator (D) compete and improve together. Here's the step-by-step process

1. Generator's First Move

The generator starts with a random noise vector like random numbers. It uses this noise as a starting point to create a fake data sample such as a generated image. The generator’s internal layers transform this noise into something that looks like real data.

2. Discriminator's Turn

The discriminator receives two types of data:

Real samples from the actual training dataset.
Fake samples created by the generator.

D's job is to analyze each input and find whether it's real data or something G cooked up. It outputs a probability score between 0 and 1. A score of 1 shows the data is likely real and 0 suggests it's fake.

3. Adversarial Learning

If the discriminator correctly classifies real and fake data it gets better at its job.
If the generator fools the discriminator by creating realistic fake data, it receives a positive update and the discriminator is penalized for making a wrong decision.

4. Generator's Improvement

Each time the discriminator mistakes fake data for real, the generator learns from this success.
Through many iterations, the generator improves and creates more convincing fake samples.

5. Discriminator's Adaptation

The discriminator also learns continuously by updating itself to better spot fake data.
This constant back-and-forth makes both networks stronger over time.

6. Training Progression

As training continues, the generator becomes highly proficient at producing realistic data.
Eventually the discriminator struggles to distinguish real from fake shows that the GAN has reached a well-trained state.
At this point, the generator can produce high-quality synthetic data that can be used for different applications.

Discriminative vs Generative Models

What are Discriminative Models?

Discriminative models are ML models and, concentrate on modeling the decision boundary between several classes of data using probability estimates and maximum likelihood. These types of models, mainly used for supervised learning, are also known as conditional models.

Discriminative models are not much affected by the outliers. Although this makes them a better choice than generative models, it also leads to misclassification problem which can be a big drawback.

Popular Discriminative Models

Logistic Regression

Support Vector Machines

K-nearest Neighbor (KNN)

What are Generative Models?

Generative models are ML models and, as the name suggests, aim to capture the underlying distribution of data, and generate new data comparable to the original training data. These types of models, mainly used for unsupervised learning, are categorized as a class of statistical models capable of generating new data instances.

The only drawback of generative models, when compared to discriminative models, is that they are prone to outliers.

Popular Generative Models

Bayesian Network

Generative Adversarial Network (GAN)

Variational Autoencoders (VAEs)

Autoregressive model, Nave Bayes, Markov random field, Hidden Markov model (HMM), Latent Dirichlet Allocation (LDA) are few other examples of the commonly used generative models.

Difference Between Discriminative and Generative Models

Characteristic	Discriminative Models	Generative Models
Objective	Focus on learning the boundary between different classes directly from the data. Their primary objective is to classify input data accurately based on the learned decision boundary.	Aim to understand the underlying data distribution and generate new data points that resemble the training data. They focus on modeling the process of data generation, allowing them to create synthetic data instances.
Probability Distribution	Estimates the parameters of probability P(Y\|X) from the training dataset.	Calculates the posterior probability P(Y\|X) using the Bayes Theorem.
Handling Outliers	Relatively robust to outliers	Prone to outliers
Property	They do not possess generative properties.	They possess discriminative properties.
Applications	Commonly used in classification tasks, such as image recognition and sentiment analysis.	Commonly used in tasks like data generation, anomaly detection, and data augmentation, beyond traditional classification tasks.
Examples	Logistic regression, Support vector machines, Decision trees, neural nets etc.	Variational Autoencoders (VAEs), Generative adversarial network (GAN), Nave Bayes etc.

The Role of Probability Distribution in Generative Models

What is Probability Distribution?

Probability Distribution is a mathematical function that represents the probability of different possible values of a random variable within a given range.

A probability distribution is a theoretical representation of frequency distribution (FD). In statistics, FD describes the number of occurrences of a variable in a dataset. On the other hand, probability distribution, along with the frequencies of number of occurrences, also assigns probabilities to them.

Types of Probability Distributions

There are two types of probability distributions −

Discrete Probability Distributions
Continuous Probability Distributions

Discrete Probability Distributions

Discrete probability distributions are mathematical functions that describe the probabilities of different occurrences from a discrete or categorial random variables.

Discrete probability distribution includes only those values with a possible probability. In simple words, it does not include any value with zero probability. For example, 5.5 is not a possible outcome of dice rolls, hence it does not include as a probability distribution of dice rolls.

The total of the probabilities of all possible values in a discrete probability distribution is always one.

common discrete probability distributions

Discrete Probability Distribution	Explanation	Example
Bernoulli Distribution	It describes the probability of success (1) or failure (0) in a single experiment.	The outcome of a single coin flip.
Binomial Distribution	It models the number of successes in a fixed number of trials n with p probability.	The number of times it comes heads when you toss a coin 10 times.
Poisson Distribution	It predicts the k number of events occurring in a fixed interval of time or space.	The number of emails messages received per day.
Geometric Distribution	It represents the number of trials needed to achieve the first success in a sequence of trials.	The number of times a coin is flipped until it lands on heads.
Hypergeometric Distribution	It calculates the probability of drawing a specific number of successes from a finite population.	The number of red balls drawn from a bag of mixed colored balls.

Continuous Probability Distributions

Continuous probability distributions are mathematical functions that describe the probabilities of different occurrences within a continuous range of values.

This includes an infinite number of possible values. For example, in the interval [4, 5] there are infinite values between 4 and 5.

common continuous probability distributions

Continuous Probability Distribution	Explanation	Example
Continuous Uniform Distribution	It assigns equal probability to all values within equal-sized interval.	The height of a person between 5 to 6 feet.
Normal (Gaussian) Distribution	It forms a bell-shaped curve and describes the data clustered around the mean and symmetrical tails.	IQ scores
Exponential Distribution	It models the time between events in a Poisson process, where events occur at a constant rate.	The time until the next customer arrives.
Log-normal Distribution	It represents the right-skewed data when plotted on a logarithmic scale.	Stock prices, income distributions, etc.
Beta Distribution	It describes the random variables constrained to a finite interval. It is often used in Bayesian statistics.	The probability of success in a binomial trial.

Use of Probability Distributions in Generative Modeling

Probability distributions play a crucial role in generative modeling.

Data Distribution − Generative Models aim to capture the underlying probability distribution of data from which the samples are taken.
Generating New Samples − Once understanding the data distribution is done, generative models can generate new data comparable to the original dataset.
Evaluation and Training − Probability distributions are used to evaluate and train generative models. Evaluation metrics such as likelihood, perplexity, and Wasserstein distance are used to evaluate the quality of generated samples compared to the original dataset.
Variability and Uncertainty − Probability distributions are used to find the variability and uncertainty present in the data. Generative models can use this information to generate distinct and realistic samples.

Introduction to PyTorch framework for deep learning

PyTorch is defined as an open source machine learning library for Python. It is used for applications such as natural language processing.

Features

The major features of PyTorch are mentioned below −

Easy Interface − PyTorch offers easy to use API; hence it is considered to be very simple to operate and runs on Python. The code execution in this framework is quite easy.

Python usage − This library is considered to be Pythonic which smoothly integrates with the Python data science stack. Thus, it can leverage all the services and functionalities offered by the Python environment.

Computational graphs − PyTorch provides an excellent platform which offers dynamic computational graphs. Thus a user can change them during runtime. This is highly useful when a developer has no idea of how much memory is required for creating a neural network model.

PyTorch is known for having three levels of abstraction as given below −

Tensor − Imperative n-dimensional array which runs on GPU.
Variable − Node in computational graph. This stores data and gradient.
Module − Neural network layer which will store state or learnable weights.

The following are the advantages of PyTorch −

It is easy to debug and understand the code.
It includes many layers as Torch.
It includes lot of loss functions.
It can be considered as NumPy extension to GPUs.
It allows building networks whose structure is dependent on computation itself.

Pytorch - Implementing First Neural Network

To create a simple neural network with one hidden layer developing a single output unit.

Step 1

import the PyTorch library using the below command −

import torch 
import torch.nn as nn

Step 2

Define all the layers and the batch size to start executing the neural network as shown below −

# Defining input size, hidden layer size, output size and batch size respectively

n_in, n_h, n_out, batch_size = 10, 5, 1, 10

Step 3

As neural network includes a combination of input data to get the respective output data, we will be following the same procedure as given below −

# Create dummy input and target tensors (data)

x = torch.randn(batch_size, n_in) y = torch.tensor([[1.0], [0.0], [0.0], [1.0], [1.0], [1.0], [0.0], [0.0], [1.0], [1.0]])

Step 4

Create a sequential model with the help of in-built functions. Using the below lines of code, create a sequential model −

# Create a model

model = nn.Sequential(nn.Linear(n_in, n_h), nn.ReLU(), nn.Linear(n_h, n_out), nn.Sigmoid())

Step 5

Construct the loss function with the help of Gradient Descent optimizer as shown below −

#Construct the loss function

criterion = torch.nn.MSELoss()

# Construct the optimizer (Stochastic Gradient Descent in this case) optimizer = torch.optim.SGD(model.parameters(), lr = 0.01)

Step 6

Implement the gradient descent model with the iterating loop with the given lines of code −

# Gradient Descent

for epoch in range(50): # Forward pass: Compute predicted y by passing x to the model y_pred = model(x) # Compute and print loss loss = criterion(y_pred, y) print('epoch: ', epoch,' loss: ', loss.item()) # Zero gradients, perform a backward pass, and update the weights. optimizer.zero_grad() # perform a backward pass (backpropagation) loss.backward() # Update the parameters optimizer.step()

Step 7

The output generated is as follows −

epoch: 0 loss: 0.2545787990093231 epoch: 1 loss: 0.2545052170753479 epoch: 2 loss: 0.254431813955307 epoch: 3 loss: 0.25435858964920044 epoch: 4 loss: 0.2542854845523834 epoch: 5 loss: 0.25421255826950073 epoch: 6 loss: 0.25413978099823 epoch: 7 loss: 0.25406715273857117 epoch: 8 loss: 0.2539947032928467 epoch: 9 loss: 0.25392240285873413 epoch: 10 loss: 0.25385022163391113 epoch: 11 loss: 0.25377824902534485 ....

Module 2

Architecture of GAN

GANs consist of two main models that work together to create realistic synthetic data which are as follows:

1. Generator Model

The generator is a deep neural network that takes random noise as input to generate realistic data samples like images or text. It learns the underlying data patterns by adjusting its internal parameters during training through backpropagation. Its objective is to produce samples that the discriminator classifies as real.

The Role of Generator in GAN Architecture

The first primary part of GAN architecture is the Generator.

Generator: Function and Structure

The primary goal of the generator is to generate new data samples that are intended to resemble real data from the dataset. It begins with a random noise vector and transforms it through fully connected layers like Dense or Convolutional layers to generate synthetic data sample.

Generator: Layers and Components

Listed below are the layers and components of the generator neural network −

Input Layer − The generator receives a low dimensionality random noise vector or input data as input.
Fully Connected Layers − The FLC is used to increase the input noise vector dimensionality.
Transposed Convolutional Layers − These layers are also known as deconvolutional layers. It is used for upsampling i.e., to generate an output feature map having greater spatial dimension than the input feature map.
Activation Functions − Two commonly used activations functions are: Leaky ReLU and Tanh. The Leaky ReLU activation function helps in decreasing the dying ReLU problem, while the Tanh activation function makes sure that the output is within a specific range.
Output Layer − It produces the final data output like an image of a certain resolution.

Generator Loss Function: The generator tries to minimize this loss:

JG=−m1Σi=1mlogD(G(zi))

where

$J_{G}$ measure how well the generator is fooling the discriminator.
$G (z_{i})$ is the generated sample from random noise $z_{i}$
$D (G (z_{i}))$ is the discriminator’s estimated probability that the generated sample is real.

The generator aims to maximize $D (G (z_{i}))$ meaning it wants the discriminator to classify its fake data as real (probability close to 1).

The goal of generator neural network is to create data that the discriminator cannot distinguish from real data. This can be achieved by minimizing the generators loss function.

2. Discriminator Model

The discriminator acts as a binary classifier helps in distinguishing between real and generated data. It learns to improve its classification ability through training, refining its parameters to detect fake samples more accurately. When dealing with image data, the discriminator uses convolutional layers or other relevant architectures which help to extract features and enhance the model’s ability.

The Role of Discriminator in GAN Architecture

The second part of GAN architecture is the Discriminator.

Discriminator: Function and Structure

The primary goal of the discriminator is to classify the input data as real or generated by the generator. It takes a data sample as input and gives a probability as output that indicates whether the sample is real or fake.

Discriminator: Layers and Components

Listed below are the layers and components of the discriminator neural network −

Input Layer − The discriminator receives a data sample from either the real dataset or the generator as input.
Convolutional Layers − It is used for downsampling the input data to extract relevant features.
Fully Connected Layers − The FLC is used to process the extracted features and make a final classification.
Activation Functions − It uses Leaky ReLU activation function to address the vanishing gradient problem. It also introduces non-linearity.
Output Layer − As name implies, it gives a single probability value between 0 and 1 as output that indicates whether the sample is real or fake.

Discriminator Loss Function: The discriminator tries to minimize this loss:

$J_{D} = - \frac{1}{m} Σ_{i = 1}^{m} l o g D (x_{i}) - \frac{1}{m} Σ_{i = 1}^{m} l o g (1 - D (G (z_{i}))$

$J_{D}$ measures how well the discriminator classifies real and fake samples.
$x_{i}$ is a real data sample.
$G (z_{i})$ is a fake sample from the generator.
$D (x_{i})$ is the discriminator’s probability that $x_{i}$ is real.
$D (G (z_{i}))$ is the discriminator’s probability that the fake sample is real.

The discriminator wants to correctly classify real data as real (maximize $l o g D (x_{i})$ and fake data as fake (maximize $l o g (1 - D (G (z_{i}))$ ).

The goal of discriminator neural network is to maximize its ability to correctly distinguish real data from generated data. This is achieved by minimizing the discriminators loss function.

MinMax Loss

GANs are trained using a MinMax Loss between the generator and discriminator:

$m i n_{G} m a x_{D} (G, D) = [E_{x \sim p_{d a t a}} [l o g D (x)] + E_{z \sim p_{z} (z)} [l o g (1 - D (g (z)))]$

where,

$G$ is generator network and is $D$ is the discriminator network
$p_{d a t a} (x)$ = true data distribution
$p_{z} (z)$ = distribution of random noise (usually normal or uniform)
$D (x)$ = discriminator’s estimate of real data
$D (G (z))$ = discriminator’s estimate of generated data

The generator tries to minimize this loss (to fool the discriminator) and the discriminator tries to maximize it (to detect fakes accurately).

Types of GANs

There are several types of GANs each designed for different purposes. Here are some important types:

1. Deep Convolutional GAN (DCGAN)

Deep Convolutional GANs (DCGANs) are among the most popular types of GANs used for image generation.

They are important because they:

Uses Convolutional Neural Networks (CNNs) instead of simple multi-layer perceptrons (MLPs).
Max pooling layers are replaced with convolutional stride helps in making the model more efficient.
Fully connected layers are removed, which allows for better spatial understanding of images.

DCGANs are successful because they generate high-quality, realistic images.

Need for DCGANs:

DCGANs are introduced to reduce the problem of mode collapse. Mode collapse occurs when the generator got biased towards a few outputs and can't able to produce outputs of every variation from the dataset. For example- take the case of mnist digits dataset (digits from 0 to 9) , we want the generator should generate all type of digits but sometimes our generator got biased towards two to three digits and produce them only. Because of that the discriminator also got optimized towards that particular digits only, and this state is known as mode collapse. But this problem can be overcome by using DCGANs.

The generator of the DCGAN architecture takes 100 uniform generated values using normal distribution as an input. First, it changes the dimension to 4x4x1024 and performed a fractionally stridden convolution 4 times with a stride of 1/2 (this means every time when applied, it doubles the image dimension while reducing the number of output channels). The generated output has dimensions of (64, 64, 3). There are some architectural changes proposed in the generator such as the removal of all fully connected layers, and the use of Batch Normalization which helps in stabilizing training. ReLU activation function is used in all layers of the generator, except for the output layers.

The role of the discriminator here is to determine that the image comes from either a real dataset or a generator. The discriminator can be simply designed similar to a convolution neural network that performs an image classification task. Instead of fully connected layers, they used only strided-convolutions with LeakyReLU as an activation function, the input of the generator is a single image from the dataset or generated image and the output is a score that determines whether the image is real or generated.

2. Wasserstein GAN (WGANs):

Wasserstein Generative Adversarial Network (WGANs) is a variation of Deep Learning GAN with little modification in the algorithm. Generative Adversarial Network (GAN) is a method for constructing an efficient generative model. Martin Arjovsky, Soumith Chintala, and Léon Bottou developed this network in 2017. This is used widely to produce real images.

WGAN's architecture uses deep neural networks for both generator and discriminator. The key difference between GANs and WGANs is the loss function and the gradient penalty. WGANs were introduced as the solution to mode collapse issues.

WGAN architecture

WGANs use the Wasserstein distance, which provides a more meaningful and smoother measure of distance between distributions.

$W (P_{r}, P_{g}) = \inf_{γ ϵ \prod (P_{r}, P_{g})} E_{(x, y) \sim γ)} [∣ ∣ x - y ∣ ∣]$

γ denotes the mass transported from x to y in order to transform the distribution Pr to Pg.
denotes the set of all joint distributions γ(x, y) whose marginals are respectively Pr and Pg.

Benefits of WGAN algorithm over GAN

WGAN is more stable due to the Wasserstein Distance which is continuous and differentiable everywhere allowing to perform gradient descent.
It allows to train the critic till optimality.
There is still no evidence of model collapse.
Not struck in local minima in gradient descent.
WGANs provide more flexibility in the choice of network architectures. The weight clipping, generators architectures can be changed according to choose.

3. Conditional GAN (CGANs):

Conditional GAN (cGAN) extends the GAN framework by including the condition information like class labels, attributes, or even other data samples, into both the generator and the discriminator networks.

With the help of these conditioning information, Conditional GANs provide us the control over the characteristic of the generated output.

Architecture of Conditional GANs

Like traditional GANs, the architecture of a Conditional GAN consists of two main components: a generator network and a discriminative network.

The only difference is that in Conditional GANs, both the generator network and discriminative network receive additional conditioning information y along with their respective inputs. Lets understand it with the help of this diagram −

The Generator Network

The generator networks, as shown in the above diagram, takes two inputs: a random noise vector which is sampled from a predefined distribution and the conditioning information "y". It now transforms it into synthetic data samples. Once transformed, the goal of the generator is to not only produce data that is identical to real data but also align with the provided conditional information.

The Discriminator Network

The discriminator network receives both real data samples and fake samples generated by the generator, along with the conditioning information "y".

The goal of the discriminator network is to evaluate the input data and tries to distinguish between real data samples from the dataset and fake data samples generated by the generator model while considering the provided conditioning information.

Conditional Information

Conditional information often denoted by "y" is an additional information which is provided to both generator network and discriminator network to condition the generation process. Based on the application and the required control over the generated output, conditional information can take various forms.

Types of Conditional Information

Some of the common types of conditional information are as follows −

Class Labels − In image classification tasks, conditional information "y" may represent the class labels corresponding to different categories. For example, in handwritten digits dataset, "y" could indicate the digit class (0-9) that the generator network should produce.
Attributes − In image generation tasks, conditional information "y" may represent specific attributes or features of the desired output, such as the color of objects, the style of clothing, or the pose of a person.
Textual Descriptions − For text-to-image synthesis tasks, conditional information "y" may consist of textual descriptions or captions describing the desired characteristics of the generated image.

Applications of Conditional GANs

Listed below are some of the fields where Conditional GANs find its applications −

Image-to-Image Translation

Conditional GANs are best suited for tasks like translating images from one domain to another. Translating images includes converting satellite images to maps, transforming sketches into realistic images, or converting day-time scenes to night-time scenes etc.

Semantic Image Synthesis

Conditional GANs can condition on semantic labels, hence they can generate realistic images based on textual descriptions or semantic layouts.

Super-Resolution and Inpainting

Conditional GANs can also be used for image super-resolution tasks in which low-resolution images are transformed into similar high-resolution images. They can also be used for inpainting tasks in which, based on contextual information, missing parts of an image are filled in.

Style Transfer and Editing

Conditional GANs allow us to manipulate specific attributes like color, texture, or artistic style while preserving other aspects of the image.

Challenges in using Conditional GANs

Conditional GANs offer significant advancements in generative modeling but they also have some challenges. Lets see which kind of challenges you can face while using Conditional GANs −

Mode Collapse

Like traditional GANs, Conditional GANs can also experience mode collapse. In mode collapse, the generator learns to produce limited varieties of samples and fails to capture the entire data distribution.

Conditioning Information Quality

The effectiveness of Conditional GANs depends on the quality and relevance of the provided conditioning information. Noisy or irrelevant conditioning information can lead to poor generation outputs.

Training Instability

The training instability issues observed in traditional GANs can also be faced by Conditioning GANs. To avoid this, CGANs require careful architecture design and training approaches.

Scalability

With the increased complexity of conditioning information, it becomes difficult to handle Conditional GANs. It then requires more computational resources.

Evaluation Metrics for GANs

Evaluating the output of a Generative Adversarial Network isn't as straightforward as calculating accuracy or loss in supervised learning. Since the generator's goal is to produce realistic and diverse samples mimicking a target distribution, we need metrics that assess both the quality (fidelity) of individual generated images and the variety (diversity) of the entire generated set. Simply looking at samples can be subjective and doesn't scale well, while the generator and discriminator losses during training often don't correlate strongly with the perceived quality of the final output. Therefore, specialized quantitative metrics are necessary to provide objective comparisons between different GAN models or training checkpoints.

The core challenge lies in comparing probability distributions: the distribution of real data, $p_{d a t a}$ , and the distribution implicitly defined by the generator, $p_{g}$ . We want to measure how "close" $p_{g}$ is to $p_{d a t a}$ .

Two prominent metrics have emerged as standards in the field:

1. Inception Score (IS) and

2. Fréchet Inception Distance (FID).

Inception Score (IS)

The Inception Score aims to capture both fidelity and diversity using a pre-trained image classification model, typically Inception V3 trained on ImageNet. The intuition is twofold:

Fidelity: Images generated by a good GAN should be clearly recognizable and contain meaningful objects. When passed through the Inception classifier, the conditional probability distribution $p (y ∣ x)$ (the probability of image $x$ elonging to class $y$ ) should have low entropy. This means the classifier is confident about assigning the image to a specific class.
Diversity: The generator should produce images covering a wide variety of classes present in the dataset. Therefore, the marginal probability distribution $p (y) = \int p (y ∣ x) p_{g} (x) d x$ (the overall distribution of classes across all generated images) should have high entropy. This indicates that the generator isn't stuck producing images of only a few classes (mode collapse).

These two ideas are combined using the Kullback-Leibler (KL) divergence between the conditional and marginal distributions, averaged over all generated samples $x \sim p_{g}$ :

$I S = \exp (E_{x \sim p_{g}} [D_{K L} (p (y ∣ x) ∣ ∣ p (y))])$

A higher Inception Score is generally considered better. However, IS has limitations. It primarily measures whether generated images look like any of the ImageNet classes, not necessarily the specific classes in the target dataset if it's different from ImageNet. It also doesn't directly compare the generated images to real images from the target distribution and can be susceptible to adversarial examples within classes. Furthermore, it has been shown that IS doesn't always correlate well with human perception of image quality, especially regarding diversity within a class.

Fréchet Inception Distance (FID)

The Fréchet Inception Distance has become a more popular and widely adopted metric because it addresses some of the shortcomings of the IS. FID compares the statistics of generated images directly to the statistics of real images from the target dataset. It operates in the feature space of a pre-trained Inception V3 model.

Here's how FID is calculated:

Feature Extraction: Select a specific layer from the pre-trained Inception V3 network (commonly the final average pooling layer before the classification head). Pass a large number of real images ( $x_{r}$ ) and generated images ( $x_{g}$ ) through the network up to this layer to obtain feature vectors for each image.
Distribution Modeling: Assume the extracted feature vectors for the real images and the generated images follow multivariate Gaussian distributions. Calculate the mean vector ( $μ_{r}$ , $μ_{g}$ ) and the covariance matrix ( $Σ_{r}$ , $Σ_{g}$ ) for the feature vectors of the real and generated sets, respectively.
Distance Calculation: Compute the Fréchet distance (also known as the Wasserstein-2 distance for Gaussian distributions) between the two modeled distributions ( $N (μ_{r}, Σ_{r})$ and $N (μ_{g}, Σ_{g})$ ). The formula is:
$F I D = ∣ ∣ μ_{r} - μ_{g} ∣ ∣_{2}^{2} + Tr (Σ_{r} + Σ_{g} - 2 (Σ_{r} Σ_{g})^{1 / 2})$
Here, $∣ ∣ \cdot ∣ ∣_{2}^{2}$ denotes the squared Euclidean distance between the mean vectors, $Tr$ is the trace of a matrix (sum of diagonal elements), and $(Σ_{r} Σ_{g})^{1 / 2}$ is the matrix square root of the product of the covariance matrices.

A lower FID score indicates that the statistics of the generated image features are more similar to the statistics of the real image features, implying that the generated distribution $p_{g}$ is closer to the real data distribution $p_{d a t a}$ . Lower FID generally corresponds to better image quality and diversity.

FID is more sensitive to noise, sensitive to mode collapse (as it affects both mean and covariance), and correlates better with human judgment of image quality than IS. However, it requires a significant number of samples (typically 10,000 to 50,000) from both real and generated distributions to reliably estimate the means and covariance matrices. Its computation is also more intensive than IS.

Other Metrics and Considerations

Precision and Recall for Distributions: These metrics adapt concepts from information retrieval to GAN evaluation. Precision measures the fraction of generated samples that are considered realistic (fidelity), while Recall measures the fraction of real samples that the generator can produce (diversity).
Perceptual Path Length (PPL): Used primarily for style-based generators (like StyleGAN), PPL measures the smoothness of the generator's latent space. Small changes in the latent input vector should ideally lead to small, perceptually smooth changes in the output image.

Module 3

Introduction to Autoencoders

Autoencoders are an essential tool in the field of machine learning and deep learning. They are a special type of unsupervised feedforward neural network designed to learn efficient representations of the data for the purpose of dimensionality reduction, feature extraction, and generating new data.
Autoencoders consists of two components an encoder network and a decoder network. The encoder network works as a compression unit that compresses the input data into a lower-dimensional representation. On the other hand, the decoder network decompresses the compressed input data by reconstructing it.

What are Autoencoders?

Autoencoders, designed for unsupervised learning, are a class of artificial neural networks. Like any other neural network, it consists of three different types of layers-Input, hidden and output. The number of input units in the input layer are exactly equal to the output units in the output layer. But the middle layer, i.e., the hidden layer in this network has a fewer number of units than that of input and output layers.

It first compresses the input data into a lower-dimensional representation. As the hidden layer has a lower number of units, it holds this lower-dimensional representations. Finally, at the output layer, the output is rebuilt from this reduced representation of the input.

Autoencoders are also called self-supervised ML models because they are trained as supervised ML models but while using, they work as unsupervised ML models.

Architecture of Autoencoders

The core architecture of an autoencoders is divided into encoder, decoder and bottleneck layer as shown in the below diagram −

Encoder − Encoder is a fully connected feed forward neural network (FFNN) that compresses the input data into a lower-dimensional representation.
Bottleneck layer − The bottleneck layer contains the lower-dimensional representation of the input which is to be fed into the decoder.
Decoder − Decoder is a fully connected feed forward neural network (FFNN) that reconstruct the input back to the original dimensions.

Working of Autoencoder

The principle behind the working of an autoencoders is to train the neural network to reconstruct its input data from a lower-dimensional representation. This involves two main components: the encoder network and the decoder network.

The Encoder Network

The encoder network compresses the input into a lower-dimensional representation. This process involves the following steps −

Input Layer − The input data is fed into the network through input layer.
Hidden Layers − The input data now passes through several hidden layers where each layer first applies a linear transformation and then a non-linear activation function. Each layer has fewer neurons than the previous one which gradually reduces the dimensionality of the input data.
Bottleneck Layer (Latent Space Representation) − Bottleneck layer, the final layer of the encoder network, stores the compressed representation of the input. This layer helps the network to learn the most essential features of the input because it has a much lower dimensionality than the input data.

The Decoder Network

The decoder network reconstructs reconstruct the original input data from the lower-dimensional representation. This process is basically the reverse of the encoding process. It involves the following steps −

Bottleneck Layer (Latent Space Representation) − The compressed data stored by the bottleneck layer is used as the input for the decoder network.
Hidden Layers − The input data now passes through several hidden layers where each layer first applies a linear transformation and then a non-linear activation function. Each layer has more neurons than the previous one which gradually expanding the dimensionality of the input data back to the original input size.
Output Layer − Output layer, the final layer of the decoder network, reconstructs the data to match the original input dimensions.

The Training Process

The training process of network to reconstruct its input data from a lower-dimensional representation involves the steps given below −

Initialization − First the weights of the network are initialized randomly.
Forward Propagation − In this step the input data is first passed through the encoder to convert it into lower dimensions and then passed through the decoder to reconstruct the input as original.
Loss Calculation − The loss function is used to measure the difference between the original input data and its reconstructed output. Some of the common loss functions are Mean Squared Error (MSE) for continuous data or Binary Cross-Entropy for binary data.
Backward Propagation − In this step, to minimize the loss function, the network adjusts its weights. You can use gradient descent or any other optimization algorithm.

Hyperparameter Tuning

Hyperparameter tuning in autoencoder is the process of selecting the best set of parameters that control how an autoencoder work. Proper hyperparameter tuning can improve the efficiency and accuracy of an autoencoder.

Listed below are a set of key hyperparameters to be considered −

Learning Rate − It determines the step size while using the optimization algorithm for minimizing the loss function. A higher learning rate can lead to faster convergence but with less stability. On the other hand, lower learning can lead to slow convergence but with more stability.
Batch Size − It specifies the number of training examples utilized per iteration. Larger batch size can provide more accurate estimate of the gradient but require more memory and computational resources.
Number of Layers − It specify the depth of the autoencoder architecture. More number of layers can capture more complex features, but they may lead to overfitting.
Number of Neurons per Layer − It determines the number of units in each layer. More number of neurons per layer can learn more details but it increases the complexity of the model.
Activation Functions − These are the mathematical functions applied to the outputs of each layer. Different activation functions (like ReLU, Sigmoid, Tanh) can affect the performance of model.

Autoencoders Types and Applications

1. Vanilla Autoencoder

Vanilla autoencoders are the simplest form of autoencoders. They are also known as standard autoencoders. It consists of two main components: an encoder and a decoder. The role of encoder is to compress the input into a lower-dimensional representation. On the other hand, the role of the decoder is to reconstruct the original input from this compressed representation. The main objective of a vanilla autoencoder is to minimize the error between the original input and the reconstructed output.

Applications of Vanilla Autoencoder

Vanilla autoencoders are simple yet powerful tools for machine learning tasks. Below are its applications −

Feature Extraction

Vanilla autoencoders can extract meaningful features from the input data. We can even use these features as input for other ML tasks. For example, in NLP, autoencoders can be used to learn word embeddings that obtain semantic similarities between words. These embeddings can also be used to improve text classification and sentiment analysis tasks.

Anomaly Detection

The ability of vanilla autoencoders to learn normal patterns in the data and identify deviations from these patterns makes them suitable for anomaly detection tasks. When the reconstruction error between new input data and training data is significantly higher than there is an anomaly. For example, autoencoders can be used in network security to detect unusual patterns of network traffic.

2. Sparse Autoencoder

Sparse autoencoders are specialized types of autoencoders that are designed to propose sparsity constraints within the hidden units or latent representation. Unlike vanilla autoencoders, which learn dense representation of input data, sparse autoencoders activate only a small number of neurons in the hidden layer. This approach helps in sparse, efficient representation of data and focusing on the most relevant features.

The structure of Sparse autoencoder is like vanilla autoencoder but the key difference lies in the training process where a sparsity constraint is added in the hidden layer. This constraint can be applied either by using regularization technique like L1 which penalizes the activation of hidden neurons or by explicitly limiting the number of active neurons.

Applications of Sparse Autoencoder

Sparse autoencoders has applications that leverage their ability to learn sparse representations −

Medical Imaging Analysis

Sparse autoencoders can be used to analyze medical images like MRI or CT scans. For example, by learning sparse representations that highlight critical regions of interest, they can help in detecting anomalies or specific structures like tumors or lesions within the images. This application is important as it helps identify diseases at an early stage.

Text Clustering and Topic Modeling

Sparse autoencoders can be used in NLP for text clustering and topic modeling tasks. For example, by learning sparse representations of text data these models can identify and group together documents with similar themes or topics.

3. Denoising Autoencoder

Denoising autoencoders (DAEs), as the name implies, are a special type of neural networks which are designed to learn efficient representation of data by removing noise from the input. During training, noise is added to the input data, and they reconstruct clean, noise-free data from this corrupted or noisy input.

Applications of Denoising Autoencoder

Denoising autoencoders are useful in various applications where data quality can be affected by noise. Lets check out some of its applications −

Image Denoising

DAEs are used in image processing tasks to remove noises like gaussian, salt-and-paper, and blur motion from photographs and visual data. For example, DAEs can improve the quality of MRI, CT-Scan or X-ray images by removing the noise.

Speech Enhancement

DAEs can be used in the field of audio processing to improve the clarity of speech recordings and enhance the quality of audio signal by removing the background noise. For example, in speech recognition systems, DAEs can improve the accuracy of speech-to-text conversion.

4. Contractive Autoencoder

Contractive autoencoders (CAEs) autoencoders are designed to learn stable and reliable features from input data. During training, they add a special penalty to the learning process to make sure that small changes in the input will not cause big changes in the learned features. Its advantage is that the model will focus on the important patterns in the data and ignores the noise.

Applications of Contractive Autoencoder

Below are some of the useful applications of Contractive autoencoders −

Robust Feature Learning

CAEs can be used to learn features that are robust to noise and some minor changes in the input data. For example, they are useful in image recognition tasks where small changes in angle or other effects should not change the models understanding about that image.

Data Compression

CAEs can be used to compress data while preserving the important features. This makes them suitable for applications where bandwidth and storage are limited, like in mobiles and IoT devices.

5. Convolutional Autoencoder

Convolutional autoencoder is one of the most powerful variants of autoencoders. It is specially designed for processing and generating images due to their ability to capture spatial dependencies and hierarchical patterns present in visual data.

The structure of convolutional autoencoder consists of an encoder and decoder. The encoder consists of convolutional layers followed by pooling layers. It reduces the spatial dimensions of the input image. The decoder, on the other hand, takes the latent representation from encoder and reconstructs the original input image by using transposed convolutional layers.

Applications of Convolutional Autoencoder

Below are the applications of Convolutional autoencoders −

Image Reconstruction

Convolutional autoencoders can be used to reconstruct high-resolution images from the compressed latent representations. It makes them useful in image editing and restoration tasks.

Image Compression

Convolutional autoencoders can be used to compress high-resolution images into a lower dimensional representation. It makes them useful in tasks that require reducing storage space while maintaining the quality of image.

Generative AI using python