Module1

Generative AI (GenAI) is the latest subtype of AI that broadly describes Machine Learning (ML) models or algorithms.

Difference Between Traditional AI and Generative AI

Traditional AI Generative AI
AI is used to create intelligent systems that can perform those tasks which generally require human intelligence. It generates new text, audio, video, or any other type of content by learning patterns from existing training data.
The purpose of AI algorithms or models are to mimic human intelligence across wide range of applications. The purpose of generative AI algorithms or models is to generate new data having similar characteristics as data from the original dataset.

Traditional AI	Generative AI
AI is used to create intelligent systems that can perform those tasks which generally require human intelligence.	It generates new text, audio, video, or any other type of content by learning patterns from existing training data.
The purpose of AI algorithms or models are to mimic human intelligence across wide range of applications.	The purpose of generative AI algorithms or models is to generate new data having similar characteristics as data from the original dataset.

Overview of Generative Adversarial Network

GAN stands for Generative Adversarial Network, and it is a class of artificial intelligence algorithms used in machine learning and deep learning for generating data. GANs were introduced by Ian Goodfellow and his colleagues in 2014 and have since become a popular and powerful tool in various applications, including image generation, text generation, and more.

How does a GAN work?

GANs train by having two networks the Generator (G) and the Discriminator (D) compete and improve together. Here's the step-by-step process

1. Generator's First Move

The generator starts with a random noise vector like random numbers. It uses this noise as a starting point to create a fake data sample such as a generated image. The generator’s internal layers transform this noise into something that looks like real data.

2. Discriminator's Turn

The discriminator receives two types of data:

Real samples from the actual training dataset.
Fake samples created by the generator.

D's job is to analyze each input and find whether it's real data or something G cooked up. It outputs a probability score between 0 and 1. A score of 1 shows the data is likely real and 0 suggests it's fake.

3. Adversarial Learning

If the discriminator correctly classifies real and fake data it gets better at its job.
If the generator fools the discriminator by creating realistic fake data, it receives a positive update and the discriminator is penalized for making a wrong decision.

4. Generator's Improvement

Each time the discriminator mistakes fake data for real, the generator learns from this success.
Through many iterations, the generator improves and creates more convincing fake samples.

5. Discriminator's Adaptation

The discriminator also learns continuously by updating itself to better spot fake data.
This constant back-and-forth makes both networks stronger over time.

6. Training Progression

As training continues, the generator becomes highly proficient at producing realistic data.
Eventually the discriminator struggles to distinguish real from fake shows that the GAN has reached a well-trained state.
At this point, the generator can produce high-quality synthetic data that can be used for different applications.

Discriminative vs Generative Models

What are Discriminative Models?

Discriminative models are ML models and, concentrate on modeling the decision boundary between several classes of data using probability estimates and maximum likelihood. These types of models, mainly used for supervised learning, are also known as conditional models.

Discriminative models are not much affected by the outliers. Although this makes them a better choice than generative models, it also leads to misclassification problem which can be a big drawback.

Popular Discriminative Models

Logistic Regression

Support Vector Machines

K-nearest Neighbor (KNN)

What are Generative Models?

Generative models are ML models and, as the name suggests, aim to capture the underlying distribution of data, and generate new data comparable to the original training data. These types of models, mainly used for unsupervised learning, are categorized as a class of statistical models capable of generating new data instances.

The only drawback of generative models, when compared to discriminative models, is that they are prone to outliers.

Popular Generative Models

Bayesian Network

Generative Adversarial Network (GAN)

Variational Autoencoders (VAEs)

Autoregressive model, Nave Bayes, Markov random field, Hidden Markov model (HMM), Latent Dirichlet Allocation (LDA) are few other examples of the commonly used generative models.

Difference Between Discriminative and Generative Models

Characteristic	Discriminative Models	Generative Models
Objective	Focus on learning the boundary between different classes directly from the data. Their primary objective is to classify input data accurately based on the learned decision boundary.	Aim to understand the underlying data distribution and generate new data points that resemble the training data. They focus on modeling the process of data generation, allowing them to create synthetic data instances.
Probability Distribution	Estimates the parameters of probability P(Y\|X) from the training dataset.	Calculates the posterior probability P(Y\|X) using the Bayes Theorem.
Handling Outliers	Relatively robust to outliers	Prone to outliers
Property	They do not possess generative properties.	They possess discriminative properties.
Applications	Commonly used in classification tasks, such as image recognition and sentiment analysis.	Commonly used in tasks like data generation, anomaly detection, and data augmentation, beyond traditional classification tasks.
Examples	Logistic regression, Support vector machines, Decision trees, neural nets etc.	Variational Autoencoders (VAEs), Generative adversarial network (GAN), Nave Bayes etc.

The Role of Probability Distribution in Generative Models

What is Probability Distribution?

Probability Distribution is a mathematical function that represents the probability of different possible values of a random variable within a given range.

A probability distribution is a theoretical representation of frequency distribution (FD). In statistics, FD describes the number of occurrences of a variable in a dataset. On the other hand, probability distribution, along with the frequencies of number of occurrences, also assigns probabilities to them.

Types of Probability Distributions

There are two types of probability distributions −

Discrete Probability Distributions
Continuous Probability Distributions

Discrete Probability Distributions

Discrete probability distributions are mathematical functions that describe the probabilities of different occurrences from a discrete or categorial random variables.

Discrete probability distribution includes only those values with a possible probability. In simple words, it does not include any value with zero probability. For example, 5.5 is not a possible outcome of dice rolls, hence it does not include as a probability distribution of dice rolls.

The total of the probabilities of all possible values in a discrete probability distribution is always one.

common discrete probability distributions

Discrete Probability Distribution	Explanation	Example
Bernoulli Distribution	It describes the probability of success (1) or failure (0) in a single experiment.	The outcome of a single coin flip.
Binomial Distribution	It models the number of successes in a fixed number of trials n with p probability.	The number of times it comes heads when you toss a coin 10 times.
Poisson Distribution	It predicts the k number of events occurring in a fixed interval of time or space.	The number of emails messages received per day.
Geometric Distribution	It represents the number of trials needed to achieve the first success in a sequence of trials.	The number of times a coin is flipped until it lands on heads.
Hypergeometric Distribution	It calculates the probability of drawing a specific number of successes from a finite population.	The number of red balls drawn from a bag of mixed colored balls.

Continuous Probability Distributions

Continuous probability distributions are mathematical functions that describe the probabilities of different occurrences within a continuous range of values.

This includes an infinite number of possible values. For example, in the interval [4, 5] there are infinite values between 4 and 5.

common continuous probability distributions

Continuous Probability Distribution	Explanation	Example
Continuous Uniform Distribution	It assigns equal probability to all values within equal-sized interval.	The height of a person between 5 to 6 feet.
Normal (Gaussian) Distribution	It forms a bell-shaped curve and describes the data clustered around the mean and symmetrical tails.	IQ scores
Exponential Distribution	It models the time between events in a Poisson process, where events occur at a constant rate.	The time until the next customer arrives.
Log-normal Distribution	It represents the right-skewed data when plotted on a logarithmic scale.	Stock prices, income distributions, etc.
Beta Distribution	It describes the random variables constrained to a finite interval. It is often used in Bayesian statistics.	The probability of success in a binomial trial.

Use of Probability Distributions in Generative Modeling

Probability distributions play a crucial role in generative modeling.

Data Distribution − Generative Models aim to capture the underlying probability distribution of data from which the samples are taken.
Generating New Samples − Once understanding the data distribution is done, generative models can generate new data comparable to the original dataset.
Evaluation and Training − Probability distributions are used to evaluate and train generative models. Evaluation metrics such as likelihood, perplexity, and Wasserstein distance are used to evaluate the quality of generated samples compared to the original dataset.
Variability and Uncertainty − Probability distributions are used to find the variability and uncertainty present in the data. Generative models can use this information to generate distinct and realistic samples.

Introduction to PyTorch framework for deep learning

PyTorch is defined as an open source machine learning library for Python. It is used for applications such as natural language processing.

Features

The major features of PyTorch are mentioned below −

Easy Interface − PyTorch offers easy to use API; hence it is considered to be very simple to operate and runs on Python. The code execution in this framework is quite easy.

Python usage − This library is considered to be Pythonic which smoothly integrates with the Python data science stack. Thus, it can leverage all the services and functionalities offered by the Python environment.

Computational graphs − PyTorch provides an excellent platform which offers dynamic computational graphs. Thus a user can change them during runtime. This is highly useful when a developer has no idea of how much memory is required for creating a neural network model.

PyTorch is known for having three levels of abstraction as given below −

Tensor − Imperative n-dimensional array which runs on GPU.
Variable − Node in computational graph. This stores data and gradient.
Module − Neural network layer which will store state or learnable weights.

The following are the advantages of PyTorch −

It is easy to debug and understand the code.
It includes many layers as Torch.
It includes lot of loss functions.
It can be considered as NumPy extension to GPUs.
It allows building networks whose structure is dependent on computation itself.

Pytorch - Implementing First Neural Network

To create a simple neural network with one hidden layer developing a single output unit.

Step 1

import the PyTorch library using the below command −

import torch 
import torch.nn as nn

Step 2

Define all the layers and the batch size to start executing the neural network as shown below −

# Defining input size, hidden layer size, output size and batch size respectively

n_in, n_h, n_out, batch_size = 10, 5, 1, 10

Step 3

As neural network includes a combination of input data to get the respective output data, we will be following the same procedure as given below −

# Create dummy input and target tensors (data)

x = torch.randn(batch_size, n_in) y = torch.tensor([[1.0], [0.0], [0.0], [1.0], [1.0], [1.0], [0.0], [0.0], [1.0], [1.0]])

Step 4

Create a sequential model with the help of in-built functions. Using the below lines of code, create a sequential model −

# Create a model

model = nn.Sequential(nn.Linear(n_in, n_h), nn.ReLU(), nn.Linear(n_h, n_out), nn.Sigmoid())

Step 5

Construct the loss function with the help of Gradient Descent optimizer as shown below −

#Construct the loss function

criterion = torch.nn.MSELoss()

# Construct the optimizer (Stochastic Gradient Descent in this case) optimizer = torch.optim.SGD(model.parameters(), lr = 0.01)

Step 6

Implement the gradient descent model with the iterating loop with the given lines of code −

# Gradient Descent

for epoch in range(50): # Forward pass: Compute predicted y by passing x to the model y_pred = model(x) # Compute and print loss loss = criterion(y_pred, y) print('epoch: ', epoch,' loss: ', loss.item()) # Zero gradients, perform a backward pass, and update the weights. optimizer.zero_grad() # perform a backward pass (backpropagation) loss.backward() # Update the parameters optimizer.step()

Step 7

The output generated is as follows −

epoch: 0 loss: 0.2545787990093231 epoch: 1 loss: 0.2545052170753479 epoch: 2 loss: 0.254431813955307 epoch: 3 loss: 0.25435858964920044 epoch: 4 loss: 0.2542854845523834 epoch: 5 loss: 0.25421255826950073 epoch: 6 loss: 0.25413978099823 epoch: 7 loss: 0.25406715273857117 epoch: 8 loss: 0.2539947032928467 epoch: 9 loss: 0.25392240285873413 epoch: 10 loss: 0.25385022163391113 epoch: 11 loss: 0.25377824902534485 ....

Module 2

Architecture of GAN

GANs consist of two main models that work together to create realistic synthetic data which are as follows:

1. Generator Model

The generator is a deep neural network that takes random noise as input to generate realistic data samples like images or text. It learns the underlying data patterns by adjusting its internal parameters during training through backpropagation. Its objective is to produce samples that the discriminator classifies as real.

The Role of Generator in GAN Architecture

The first primary part of GAN architecture is the Generator.

Generator: Function and Structure

The primary goal of the generator is to generate new data samples that are intended to resemble real data from the dataset. It begins with a random noise vector and transforms it through fully connected layers like Dense or Convolutional layers to generate synthetic data sample.

Generator: Layers and Components

Listed below are the layers and components of the generator neural network −

Input Layer − The generator receives a low dimensionality random noise vector or input data as input.
Fully Connected Layers − The FLC is used to increase the input noise vector dimensionality.
Transposed Convolutional Layers − These layers are also known as deconvolutional layers. It is used for upsampling i.e., to generate an output feature map having greater spatial dimension than the input feature map.
Activation Functions − Two commonly used activations functions are: Leaky ReLU and Tanh. The Leaky ReLU activation function helps in decreasing the dying ReLU problem, while the Tanh activation function makes sure that the output is within a specific range.
Output Layer − It produces the final data output like an image of a certain resolution.

Generator Loss Function: The generator tries to minimize this loss:

JG=−m1Σi=1mlogD(G(zi))

where

$J_{G}$ measure how well the generator is fooling the discriminator.
$G (z_{i})$ is the generated sample from random noise $z_{i}$
$D (G (z_{i}))$ is the discriminator’s estimated probability that the generated sample is real.

The generator aims to maximize $D (G (z_{i}))$ meaning it wants the discriminator to classify its fake data as real (probability close to 1).

The goal of generator neural network is to create data that the discriminator cannot distinguish from real data. This can be achieved by minimizing the generators loss function.

2. Discriminator Model

The discriminator acts as a binary classifier helps in distinguishing between real and generated data. It learns to improve its classification ability through training, refining its parameters to detect fake samples more accurately. When dealing with image data, the discriminator uses convolutional layers or other relevant architectures which help to extract features and enhance the model’s ability.

The Role of Discriminator in GAN Architecture

The second part of GAN architecture is the Discriminator.

Discriminator: Function and Structure

The primary goal of the discriminator is to classify the input data as real or generated by the generator. It takes a data sample as input and gives a probability as output that indicates whether the sample is real or fake.

Discriminator: Layers and Components

Listed below are the layers and components of the discriminator neural network −

Input Layer − The discriminator receives a data sample from either the real dataset or the generator as input.
Convolutional Layers − It is used for downsampling the input data to extract relevant features.
Fully Connected Layers − The FLC is used to process the extracted features and make a final classification.
Activation Functions − It uses Leaky ReLU activation function to address the vanishing gradient problem. It also introduces non-linearity.
Output Layer − As name implies, it gives a single probability value between 0 and 1 as output that indicates whether the sample is real or fake.

Discriminator Loss Function: The discriminator tries to minimize this loss:

$J_{D} = - \frac{1}{m} Σ_{i = 1}^{m} l o g D (x_{i}) - \frac{1}{m} Σ_{i = 1}^{m} l o g (1 - D (G (z_{i}))$

$J_{D}$ measures how well the discriminator classifies real and fake samples.
$x_{i}$ is a real data sample.
$G (z_{i})$ is a fake sample from the generator.
$D (x_{i})$ is the discriminator’s probability that $x_{i}$ is real.
$D (G (z_{i}))$ is the discriminator’s probability that the fake sample is real.

The discriminator wants to correctly classify real data as real (maximize $l o g D (x_{i})$ and fake data as fake (maximize $l o g (1 - D (G (z_{i}))$ ).

The goal of discriminator neural network is to maximize its ability to correctly distinguish real data from generated data. This is achieved by minimizing the discriminators loss function.

MinMax Loss

GANs are trained using a MinMax Loss between the generator and discriminator:

$m i n_{G} m a x_{D} (G, D) = [E_{x \sim p_{d a t a}} [l o g D (x)] + E_{z \sim p_{z} (z)} [l o g (1 - D (g (z)))]$

where,

$G$ is generator network and is $D$ is the discriminator network
$p_{d a t a} (x)$ = true data distribution
$p_{z} (z)$ = distribution of random noise (usually normal or uniform)
$D (x)$ = discriminator’s estimate of real data
$D (G (z))$ = discriminator’s estimate of generated data

The generator tries to minimize this loss (to fool the discriminator) and the discriminator tries to maximize it (to detect fakes accurately).

Types of GANs

There are several types of GANs each designed for different purposes. Here are some important types:

1. Deep Convolutional GAN (DCGAN)

Deep Convolutional GANs (DCGANs) are among the most popular types of GANs used for image generation.

They are important because they:

Uses Convolutional Neural Networks (CNNs) instead of simple multi-layer perceptrons (MLPs).
Max pooling layers are replaced with convolutional stride helps in making the model more efficient.
Fully connected layers are removed, which allows for better spatial understanding of images.

DCGANs are successful because they generate high-quality, realistic images.

Need for DCGANs:

DCGANs are introduced to reduce the problem of mode collapse. Mode collapse occurs when the generator got biased towards a few outputs and can't able to produce outputs of every variation from the dataset. For example- take the case of mnist digits dataset (digits from 0 to 9) , we want the generator should generate all type of digits but sometimes our generator got biased towards two to three digits and produce them only. Because of that the discriminator also got optimized towards that particular digits only, and this state is known as mode collapse. But this problem can be overcome by using DCGANs.

The generator of the DCGAN architecture takes 100 uniform generated values using normal distribution as an input. First, it changes the dimension to 4x4x1024 and performed a fractionally stridden convolution 4 times with a stride of 1/2 (this means every time when applied, it doubles the image dimension while reducing the number of output channels). The generated output has dimensions of (64, 64, 3). There are some architectural changes proposed in the generator such as the removal of all fully connected layers, and the use of Batch Normalization which helps in stabilizing training. ReLU activation function is used in all layers of the generator, except for the output layers.

The role of the discriminator here is to determine that the image comes from either a real dataset or a generator. The discriminator can be simply designed similar to a convolution neural network that performs an image classification task. Instead of fully connected layers, they used only strided-convolutions with LeakyReLU as an activation function, the input of the generator is a single image from the dataset or generated image and the output is a score that determines whether the image is real or generated.

2. Wasserstein GAN (WGANs):

Wasserstein Generative Adversarial Network (WGANs) is a variation of Deep Learning GAN with little modification in the algorithm. Generative Adversarial Network (GAN) is a method for constructing an efficient generative model. Martin Arjovsky, Soumith Chintala, and Léon Bottou developed this network in 2017. This is used widely to produce real images.

WGAN's architecture uses deep neural networks for both generator and discriminator. The key difference between GANs and WGANs is the loss function and the gradient penalty. WGANs were introduced as the solution to mode collapse issues.

WGAN architecture

WGANs use the Wasserstein distance, which provides a more meaningful and smoother measure of distance between distributions.

$W (P_{r}, P_{g}) = \inf_{γ ϵ \prod (P_{r}, P_{g})} E_{(x, y) \sim γ)} [∣ ∣ x - y ∣ ∣]$

γ denotes the mass transported from x to y in order to transform the distribution Pr to Pg.
denotes the set of all joint distributions γ(x, y) whose marginals are respectively Pr and Pg.

Benefits of WGAN algorithm over GAN

WGAN is more stable due to the Wasserstein Distance which is continuous and differentiable everywhere allowing to perform gradient descent.
It allows to train the critic till optimality.
There is still no evidence of model collapse.
Not struck in local minima in gradient descent.
WGANs provide more flexibility in the choice of network architectures. The weight clipping, generators architectures can be changed according to choose.

3. Conditional GAN (CGANs):

Conditional GAN (cGAN) extends the GAN framework by including the condition information like class labels, attributes, or even other data samples, into both the generator and the discriminator networks.

With the help of these conditioning information, Conditional GANs provide us the control over the characteristic of the generated output.

Architecture of Conditional GANs

Like traditional GANs, the architecture of a Conditional GAN consists of two main components: a generator network and a discriminative network.

The only difference is that in Conditional GANs, both the generator network and discriminative network receive additional conditioning information y along with their respective inputs. Lets understand it with the help of this diagram −

The Generator Network

The generator networks, as shown in the above diagram, takes two inputs: a random noise vector which is sampled from a predefined distribution and the conditioning information "y". It now transforms it into synthetic data samples. Once transformed, the goal of the generator is to not only produce data that is identical to real data but also align with the provided conditional information.

The Discriminator Network

The discriminator network receives both real data samples and fake samples generated by the generator, along with the conditioning information "y".

The goal of the discriminator network is to evaluate the input data and tries to distinguish between real data samples from the dataset and fake data samples generated by the generator model while considering the provided conditioning information.

Conditional Information

Conditional information often denoted by "y" is an additional information which is provided to both generator network and discriminator network to condition the generation process. Based on the application and the required control over the generated output, conditional information can take various forms.

Types of Conditional Information

Some of the common types of conditional information are as follows −

Class Labels − In image classification tasks, conditional information "y" may represent the class labels corresponding to different categories. For example, in handwritten digits dataset, "y" could indicate the digit class (0-9) that the generator network should produce.
Attributes − In image generation tasks, conditional information "y" may represent specific attributes or features of the desired output, such as the color of objects, the style of clothing, or the pose of a person.
Textual Descriptions − For text-to-image synthesis tasks, conditional information "y" may consist of textual descriptions or captions describing the desired characteristics of the generated image.

Applications of Conditional GANs

Listed below are some of the fields where Conditional GANs find its applications −

Image-to-Image Translation

Conditional GANs are best suited for tasks like translating images from one domain to another. Translating images includes converting satellite images to maps, transforming sketches into realistic images, or converting day-time scenes to night-time scenes etc.

Semantic Image Synthesis

Conditional GANs can condition on semantic labels, hence they can generate realistic images based on textual descriptions or semantic layouts.

Super-Resolution and Inpainting

Conditional GANs can also be used for image super-resolution tasks in which low-resolution images are transformed into similar high-resolution images. They can also be used for inpainting tasks in which, based on contextual information, missing parts of an image are filled in.

Style Transfer and Editing

Conditional GANs allow us to manipulate specific attributes like color, texture, or artistic style while preserving other aspects of the image.

Challenges in using Conditional GANs

Conditional GANs offer significant advancements in generative modeling but they also have some challenges. Lets see which kind of challenges you can face while using Conditional GANs −

Mode Collapse

Like traditional GANs, Conditional GANs can also experience mode collapse. In mode collapse, the generator learns to produce limited varieties of samples and fails to capture the entire data distribution.

Conditioning Information Quality

The effectiveness of Conditional GANs depends on the quality and relevance of the provided conditioning information. Noisy or irrelevant conditioning information can lead to poor generation outputs.

Training Instability

The training instability issues observed in traditional GANs can also be faced by Conditioning GANs. To avoid this, CGANs require careful architecture design and training approaches.

Scalability

With the increased complexity of conditioning information, it becomes difficult to handle Conditional GANs. It then requires more computational resources.

Evaluation Metrics for GANs

Evaluating the output of a Generative Adversarial Network isn't as straightforward as calculating accuracy or loss in supervised learning. Since the generator's goal is to produce realistic and diverse samples mimicking a target distribution, we need metrics that assess both the quality (fidelity) of individual generated images and the variety (diversity) of the entire generated set. Simply looking at samples can be subjective and doesn't scale well, while the generator and discriminator losses during training often don't correlate strongly with the perceived quality of the final output. Therefore, specialized quantitative metrics are necessary to provide objective comparisons between different GAN models or training checkpoints.

The core challenge lies in comparing probability distributions: the distribution of real data, $p_{d a t a}$ , and the distribution implicitly defined by the generator, $p_{g}$ . We want to measure how "close" $p_{g}$ is to $p_{d a t a}$ .

Two prominent metrics have emerged as standards in the field:

1. Inception Score (IS) and

2. Fréchet Inception Distance (FID).

Inception Score (IS)

The Inception Score aims to capture both fidelity and diversity using a pre-trained image classification model, typically Inception V3 trained on ImageNet. The intuition is twofold:

Fidelity: Images generated by a good GAN should be clearly recognizable and contain meaningful objects. When passed through the Inception classifier, the conditional probability distribution $p (y ∣ x)$ (the probability of image $x$ elonging to class $y$ ) should have low entropy. This means the classifier is confident about assigning the image to a specific class.
Diversity: The generator should produce images covering a wide variety of classes present in the dataset. Therefore, the marginal probability distribution $p (y) = \int p (y ∣ x) p_{g} (x) d x$ (the overall distribution of classes across all generated images) should have high entropy. This indicates that the generator isn't stuck producing images of only a few classes (mode collapse).

These two ideas are combined using the Kullback-Leibler (KL) divergence between the conditional and marginal distributions, averaged over all generated samples $x \sim p_{g}$ :

$I S = \exp (E_{x \sim p_{g}} [D_{K L} (p (y ∣ x) ∣ ∣ p (y))])$

A higher Inception Score is generally considered better. However, IS has limitations. It primarily measures whether generated images look like any of the ImageNet classes, not necessarily the specific classes in the target dataset if it's different from ImageNet. It also doesn't directly compare the generated images to real images from the target distribution and can be susceptible to adversarial examples within classes. Furthermore, it has been shown that IS doesn't always correlate well with human perception of image quality, especially regarding diversity within a class.

Fréchet Inception Distance (FID)

The Fréchet Inception Distance has become a more popular and widely adopted metric because it addresses some of the shortcomings of the IS. FID compares the statistics of generated images directly to the statistics of real images from the target dataset. It operates in the feature space of a pre-trained Inception V3 model.

Here's how FID is calculated:

Feature Extraction: Select a specific layer from the pre-trained Inception V3 network (commonly the final average pooling layer before the classification head). Pass a large number of real images ( $x_{r}$ ) and generated images ( $x_{g}$ ) through the network up to this layer to obtain feature vectors for each image.
Distribution Modeling: Assume the extracted feature vectors for the real images and the generated images follow multivariate Gaussian distributions. Calculate the mean vector ( $μ_{r}$ , $μ_{g}$ ) and the covariance matrix ( $Σ_{r}$ , $Σ_{g}$ ) for the feature vectors of the real and generated sets, respectively.
Distance Calculation: Compute the Fréchet distance (also known as the Wasserstein-2 distance for Gaussian distributions) between the two modeled distributions ( $N (μ_{r}, Σ_{r})$ and $N (μ_{g}, Σ_{g})$ ). The formula is:
$F I D = ∣ ∣ μ_{r} - μ_{g} ∣ ∣_{2}^{2} + Tr (Σ_{r} + Σ_{g} - 2 (Σ_{r} Σ_{g})^{1 / 2})$
Here, $∣ ∣ \cdot ∣ ∣_{2}^{2}$ denotes the squared Euclidean distance between the mean vectors, $Tr$ is the trace of a matrix (sum of diagonal elements), and $(Σ_{r} Σ_{g})^{1 / 2}$ is the matrix square root of the product of the covariance matrices.

A lower FID score indicates that the statistics of the generated image features are more similar to the statistics of the real image features, implying that the generated distribution $p_{g}$ is closer to the real data distribution $p_{d a t a}$ . Lower FID generally corresponds to better image quality and diversity.

FID is more sensitive to noise, sensitive to mode collapse (as it affects both mean and covariance), and correlates better with human judgment of image quality than IS. However, it requires a significant number of samples (typically 10,000 to 50,000) from both real and generated distributions to reliably estimate the means and covariance matrices. Its computation is also more intensive than IS.

Other Metrics and Considerations

Precision and Recall for Distributions: These metrics adapt concepts from information retrieval to GAN evaluation. Precision measures the fraction of generated samples that are considered realistic (fidelity), while Recall measures the fraction of real samples that the generator can produce (diversity).
Perceptual Path Length (PPL): Used primarily for style-based generators (like StyleGAN), PPL measures the smoothness of the generator's latent space. Small changes in the latent input vector should ideally lead to small, perceptually smooth changes in the output image.

Module 3

Introduction to Autoencoders

Autoencoders are an essential tool in the field of machine learning and deep learning. They are a special type of unsupervised feedforward neural network designed to learn efficient representations of the data for the purpose of dimensionality reduction, feature extraction, and generating new data.
Autoencoders consists of two components an encoder network and a decoder network. The encoder network works as a compression unit that compresses the input data into a lower-dimensional representation. On the other hand, the decoder network decompresses the compressed input data by reconstructing it.

What are Autoencoders?

Autoencoders, designed for unsupervised learning, are a class of artificial neural networks. Like any other neural network, it consists of three different types of layers-Input, hidden and output. The number of input units in the input layer are exactly equal to the output units in the output layer. But the middle layer, i.e., the hidden layer in this network has a fewer number of units than that of input and output layers.

It first compresses the input data into a lower-dimensional representation. As the hidden layer has a lower number of units, it holds this lower-dimensional representations. Finally, at the output layer, the output is rebuilt from this reduced representation of the input.

Autoencoders are also called self-supervised ML models because they are trained as supervised ML models but while using, they work as unsupervised ML models.

Architecture of Autoencoders

The core architecture of an autoencoders is divided into encoder, decoder and bottleneck layer as shown in the below diagram −

Encoder − Encoder is a fully connected feed forward neural network (FFNN) that compresses the input data into a lower-dimensional representation.
Bottleneck layer − The bottleneck layer contains the lower-dimensional representation of the input which is to be fed into the decoder.
Decoder − Decoder is a fully connected feed forward neural network (FFNN) that reconstruct the input back to the original dimensions.

Working of Autoencoder

The principle behind the working of an autoencoders is to train the neural network to reconstruct its input data from a lower-dimensional representation. This involves two main components: the encoder network and the decoder network.

The Encoder Network

The encoder network compresses the input into a lower-dimensional representation. This process involves the following steps −

Input Layer − The input data is fed into the network through input layer.
Hidden Layers − The input data now passes through several hidden layers where each layer first applies a linear transformation and then a non-linear activation function. Each layer has fewer neurons than the previous one which gradually reduces the dimensionality of the input data.
Bottleneck Layer (Latent Space Representation) − Bottleneck layer, the final layer of the encoder network, stores the compressed representation of the input. This layer helps the network to learn the most essential features of the input because it has a much lower dimensionality than the input data.

The Decoder Network

The decoder network reconstructs reconstruct the original input data from the lower-dimensional representation. This process is basically the reverse of the encoding process. It involves the following steps −

Bottleneck Layer (Latent Space Representation) − The compressed data stored by the bottleneck layer is used as the input for the decoder network.
Hidden Layers − The input data now passes through several hidden layers where each layer first applies a linear transformation and then a non-linear activation function. Each layer has more neurons than the previous one which gradually expanding the dimensionality of the input data back to the original input size.
Output Layer − Output layer, the final layer of the decoder network, reconstructs the data to match the original input dimensions.

The Training Process

The training process of network to reconstruct its input data from a lower-dimensional representation involves the steps given below −

Initialization − First the weights of the network are initialized randomly.
Forward Propagation − In this step the input data is first passed through the encoder to convert it into lower dimensions and then passed through the decoder to reconstruct the input as original.
Loss Calculation − The loss function is used to measure the difference between the original input data and its reconstructed output. Some of the common loss functions are Mean Squared Error (MSE) for continuous data or Binary Cross-Entropy for binary data.
Backward Propagation − In this step, to minimize the loss function, the network adjusts its weights. You can use gradient descent or any other optimization algorithm.

Hyperparameter Tuning

Hyperparameter tuning in autoencoder is the process of selecting the best set of parameters that control how an autoencoder work. Proper hyperparameter tuning can improve the efficiency and accuracy of an autoencoder.

Listed below are a set of key hyperparameters to be considered −

Learning Rate − It determines the step size while using the optimization algorithm for minimizing the loss function. A higher learning rate can lead to faster convergence but with less stability. On the other hand, lower learning can lead to slow convergence but with more stability.
Batch Size − It specifies the number of training examples utilized per iteration. Larger batch size can provide more accurate estimate of the gradient but require more memory and computational resources.
Number of Layers − It specify the depth of the autoencoder architecture. More number of layers can capture more complex features, but they may lead to overfitting.
Number of Neurons per Layer − It determines the number of units in each layer. More number of neurons per layer can learn more details but it increases the complexity of the model.
Activation Functions − These are the mathematical functions applied to the outputs of each layer. Different activation functions (like ReLU, Sigmoid, Tanh) can affect the performance of model.

Autoencoders Types and Applications

1. Vanilla Autoencoder

Vanilla autoencoders are the simplest form of autoencoders. They are also known as standard autoencoders. It consists of two main components: an encoder and a decoder. The role of encoder is to compress the input into a lower-dimensional representation. On the other hand, the role of the decoder is to reconstruct the original input from this compressed representation. The main objective of a vanilla autoencoder is to minimize the error between the original input and the reconstructed output.

Applications of Vanilla Autoencoder

Vanilla autoencoders are simple yet powerful tools for machine learning tasks. Below are its applications −

Feature Extraction

Vanilla autoencoders can extract meaningful features from the input data. We can even use these features as input for other ML tasks. For example, in NLP, autoencoders can be used to learn word embeddings that obtain semantic similarities between words. These embeddings can also be used to improve text classification and sentiment analysis tasks.

Anomaly Detection

The ability of vanilla autoencoders to learn normal patterns in the data and identify deviations from these patterns makes them suitable for anomaly detection tasks. When the reconstruction error between new input data and training data is significantly higher than there is an anomaly. For example, autoencoders can be used in network security to detect unusual patterns of network traffic.

2. Sparse Autoencoder

Sparse autoencoders are specialized types of autoencoders that are designed to propose sparsity constraints within the hidden units or latent representation. Unlike vanilla autoencoders, which learn dense representation of input data, sparse autoencoders activate only a small number of neurons in the hidden layer. This approach helps in sparse, efficient representation of data and focusing on the most relevant features.

The structure of Sparse autoencoder is like vanilla autoencoder but the key difference lies in the training process where a sparsity constraint is added in the hidden layer. This constraint can be applied either by using regularization technique like L1 which penalizes the activation of hidden neurons or by explicitly limiting the number of active neurons.

Applications of Sparse Autoencoder

Sparse autoencoders has applications that leverage their ability to learn sparse representations −

Medical Imaging Analysis

Sparse autoencoders can be used to analyze medical images like MRI or CT scans. For example, by learning sparse representations that highlight critical regions of interest, they can help in detecting anomalies or specific structures like tumors or lesions within the images. This application is important as it helps identify diseases at an early stage.

Text Clustering and Topic Modeling

Sparse autoencoders can be used in NLP for text clustering and topic modeling tasks. For example, by learning sparse representations of text data these models can identify and group together documents with similar themes or topics.

3. Denoising Autoencoder

Denoising autoencoders (DAEs), as the name implies, are a special type of neural networks which are designed to learn efficient representation of data by removing noise from the input. During training, noise is added to the input data, and they reconstruct clean, noise-free data from this corrupted or noisy input.

Applications of Denoising Autoencoder

Denoising autoencoders are useful in various applications where data quality can be affected by noise. Lets check out some of its applications −

Image Denoising

DAEs are used in image processing tasks to remove noises like gaussian, salt-and-paper, and blur motion from photographs and visual data. For example, DAEs can improve the quality of MRI, CT-Scan or X-ray images by removing the noise.

Speech Enhancement

DAEs can be used in the field of audio processing to improve the clarity of speech recordings and enhance the quality of audio signal by removing the background noise. For example, in speech recognition systems, DAEs can improve the accuracy of speech-to-text conversion.

4. Contractive Autoencoder

Contractive autoencoders (CAEs) autoencoders are designed to learn stable and reliable features from input data. During training, they add a special penalty to the learning process to make sure that small changes in the input will not cause big changes in the learned features. Its advantage is that the model will focus on the important patterns in the data and ignores the noise.

Applications of Contractive Autoencoder

Below are some of the useful applications of Contractive autoencoders −

Robust Feature Learning

CAEs can be used to learn features that are robust to noise and some minor changes in the input data. For example, they are useful in image recognition tasks where small changes in angle or other effects should not change the models understanding about that image.

Data Compression

CAEs can be used to compress data while preserving the important features. This makes them suitable for applications where bandwidth and storage are limited, like in mobiles and IoT devices.

5. Convolutional Autoencoder

Convolutional autoencoder is one of the most powerful variants of autoencoders. It is specially designed for processing and generating images due to their ability to capture spatial dependencies and hierarchical patterns present in visual data.

The structure of convolutional autoencoder consists of an encoder and decoder. The encoder consists of convolutional layers followed by pooling layers. It reduces the spatial dimensions of the input image. The decoder, on the other hand, takes the latent representation from encoder and reconstructs the original input image by using transposed convolutional layers.

Applications of Convolutional Autoencoder

Below are the applications of Convolutional autoencoders −

Image Reconstruction

Convolutional autoencoders can be used to reconstruct high-resolution images from the compressed latent representations. It makes them useful in image editing and restoration tasks.

Image Compression

Convolutional autoencoders can be used to compress high-resolution images into a lower dimensional representation. It makes them useful in tasks that require reducing storage space while maintaining the quality of image.

Variational Autoencoders

Variational autoencoders are a type of neural network that extends the concept of traditional autoencoders by adding a probabilistic approach to it. While the traditional autoencoders are designed to compress and regenerate the input data from latent space, VAEs, by using the probabilistic approach, can regenerate input data as well as generate new data samples by learning the underlying patterns in the input data. This ability of VAEs makes them very useful for tasks like making realistic images or creating new data points.

Traditional Autoencoders vs Variational Autoencoders

Aspect	Autoencoders	Variational Autoencoders (VAEs)
Latent Space	Autoencoders encode the input data into a deterministic point in the latent space.	Variational Autoencoders encode the input data into a probability distribution in the latent space.
Encoder Output	The encoder in autoencoders produces a single vector representation of the input.	The encoder in VAEs produces two vectors-the mean and variance of the latent distribution.
Decoder Input	The decoder in autoencoders takes the single vector from the encoder as input to regenerate the input data from latent space.	The decoder in VAEs samples from the latent space using the mean and variance vectors as input.
Training Objective	Autoencoders aim to minimize the reconstruction error between the input and the output.	VAEs aim to minimize both the reconstruction error and the KL divergence between the learned and prior distributions.
Reconstruction Loss	Autoencoders typically use Mean Squared Error (MSE) or Binary Cross-Entropy for reconstruction loss.	VAEs also use Mean Squared Error (MSE) or Binary Cross-Entropy for reconstruction loss.
Regularization	Autoencoders do not inherently include any regularization in the latent space.	VAEs include a KL divergence term to regularize the latent space.
Generative Capability	Autoencoders cannot generate new data samples from the input data.	VAEs can generate new data samples similar to the input data.
Use of Prior Distribution	Autoencoders do not use a prior distribution in the latent space.	VAEs use a prior distribution, generally a standard normal distribution, in the latent space.
Complexity	Autoencoders are easy to implement.	VAEs are more complex due to the probabilistic components and the need for regularization.
Robustness to Overfitting	Autoencoders can be prone to overfitting without proper regularization.	VAEs are less prone to overfitting due to the regularizing effect of the KL divergence term.
Output Quality	Autoencoders can accurately reconstruct input data.	VAEs can generate new, realistic data samples.
Use Cases	Autoencoders are used for dimensionality reduction, feature extraction, denoising, and anomaly detection.	VAEs are used for generative modeling, data augmentation, semi-supervised learning, and image synthesis.

Architecture of Variational Autoencoder

VAE is a special kind of autoencoder that can generate new data instead of just compressing and reconstructing it. It has three main parts:

1. Encoder (Understanding the Input)

The encoder takes input data like images or text and learns its key features. Instead of outputting one fixed value, it produces two vectors for each feature:

Mean (μ): A central value representing the data.
Standard Deviation (σ): It is a measure of how much the values can vary.

These two values define a range of possibilities instead of a single number.

2. Latent Space (Adding Some Randomness)

Instead of encoding the input as one fixed point it pick a random point within the range given by the mean and standard deviation. This randomness lets the model create slightly different versions of data which is useful for generating new, realistic samples.

3. Decoder (Reconstructing or Creating New Data)

The decoder takes the random sample from the latent space and tries to reconstruct the original input. Since the encoder gives a range, the decoder can produce new data that is similar but not identical to what it has seen.

Variational Autoencoder Loss Function

The loss function of a variational autoencoder combines the following two components −

Reconstruction Loss

The reconstruction loss is used to make sure that the decoder can accurately reconstruct the input from the latent space representation received from hidden layer. It is calculated as the mean squared error (MSE) between the original input and reconstructed input.

KL Divergence

The KL divergence measures the deviation of the learned distribution from prior distribution. The prior distribution in VAE is generally a standard normal distribution. The term KL divergence regularizes the latent space representation and ensures it has properties that are useful for generative tasks.

Total VAE Loss

The total loss function for training a VAE is the sum of the two key components i.e., the reconstruction loss and the KL divergence. This total loss makes sure that the model accurately reconstructs the input from latent space along with maintaining the generative task properties.

Applications of VAEs

Generative modeling. The core advantage of VAEs is their ability to generate new data samples that are similar to the training data but not identical to any specific instance. For example, in image synthesis, VAEs can create new images that resemble the training set but with variations, making them useful for tasks like creating new artwork, generating realistic faces, or producing new designs in fashion and architecture.
Anomaly detection. By learning the distribution of normal data, VAEs can identify deviations from this distribution as anomalies. This is particularly useful in applications like fraud detection, network security, and predictive maintenance.
Data imputation and denoising. One of VAEs' strong points is reconstructing data with missing or noisy parts. By sampling from the learned latent distribution, they are able to predict and fill in missing values or remove noise from corrupted data. This makes them valuable in applications such as medical imaging, where accurate data reconstruction is essential, or in restoring corrupted audio and visual data.
Semi-supervised learning. In semi-supervised learning scenarios, VAEs can improve classifier performance by using the latent space to capture underlying data structures, thereby enhancing the learning process with limited labeled data.
Latent space manipulation. VAEs provide a structured and continuous latent space that can be manipulated for various applications. For instance, in image editing, specific features (like lighting or facial expressions) can be adjusted by navigating the latent space. This feature is particularly useful in creative industries for modifying and enhancing images and videos.

Applications of AEs and VAEs in image generation and anomaly detection

Autoencoders (AEs) and Variational Autoencoders (VAEs) are neural network architectures widely used in image generation and anomaly detection. AEs focus on learning a compact data representation for reconstruction, while VAEs introduce a probabilistic approach to generate diverse and novel data.

Autoencoders (AEs)

Image generation applications

Vanilla autoencoders are not typically used for generating novel images because they lack a regularized latent space that would allow for meaningful sampling. However, more advanced autoencoder variants have generative capabilities:

Image denoising and enhancement: Denoising Autoencoders (DAEs) are trained by feeding them corrupted or noisy images and having them reconstruct the original, clean versions. This teaches the model to separate noise from essential image content, which can be used to improve image quality.
Image inpainting: Autoencoders can learn to fill in missing or corrupted parts of an image by understanding its underlying structure. By training on masked images, the model learns to reconstruct the complete image from the remaining pixels.
Image compression: The encoding process can be used for high-quality data compression. The compact latent representation can be stored or transmitted more efficiently, and the decoder can reconstruct the original image when needed.

Anomaly detection applications

The core principle of AE-based anomaly detection is that a model trained exclusively on normal data will have high reconstruction errors when presented with anomalous data.

Industrial defect detection: An AE can be trained on a dataset of images showing defect-free products, such as screws, pills, or textiles. During testing, the model's high reconstruction error on a product image indicates a potential flaw, like a scratch or a dent.
Video surveillance: In security applications, an AE can learn the normal patterns of motion and scene composition. Any unusual activity that deviates significantly from these patterns will produce a high reconstruction error, triggering an alert.
Fraud detection: For transactional data, an AE can be trained on normal, non-fraudulent transactions. Suspicious transactions will be poorly reconstructed by the model, allowing analysts to flag them based on a high reconstruction error threshold.

Variational Autoencoders (VAEs)

Image generation applications

Unlike standard autoencoders, VAEs learn the underlying probability distribution of the data, which enables the generation of new, realistic samples by sampling from the latent space.

Generating new, realistic images: By sampling a vector from the latent space and passing it through the decoder, VAEs can generate entirely new images that share the characteristics of the training data. This has applications in creating realistic faces, animals, or game assets.
Latent space interpolation: VAEs create a smooth, continuous latent space where similar images are clustered close together. By interpolating between two points in this space, a VAE can generate a smooth, meaningful transition from one image to another, creating morphing effects.
Conditional image generation: Conditional VAEs (CVAEs) allow for more controlled image synthesis. By providing the model with additional information, such as a class label, it can generate new images with specific attributes. For example, generating new images of a handwritten digit '7'.
Text-to-image generation: VAEs can be combined with other architectures like transformers to generate images from textual descriptions, as seen in models like DALL·E.

Anomaly detection applications

VAEs are highly effective for anomaly detection due to their probabilistic approach, which provides a more robust measure of "normality" than standard autoencoders.

Multimodal data analysis: VAEs can model the complex, intricate structure of multimodal data, which is useful in medical imaging to spot anomalies like tumors in CT scans or MRIs.
Novelty detection in latent space: Rather than just using reconstruction error, VAEs can detect anomalies by analyzing where a data point falls within the latent distribution. Data points that are far from the clusters of normal data in the latent space are likely to be anomalous.
Network intrusion detection: VAEs can be trained on normal network traffic data. Anomalous patterns, like unusual network activity, will result in poor reconstruction and a significant deviation in the latent space, flagging a potential intrusion.
Energy system monitoring: In energy storage power stations, a VAE can learn the normal operational data of battery clusters. Deviations in voltage, current, or temperature from the learned distribution can signal a fault.
Module 4
Recurrent Neural Networks
Recurrent Neural Networks (RNNs) differ from regular neural networks in how they process information. While standard neural networks pass information in one direction i.e from input to output, RNNs feed information back into the network at each step.
Imagine reading a sentence and you try to predict the next word, you don’t rely only on the current word but also remember the words that came before. RNNs work similarly by “remembering” past information and passing the output from one step as input to the next i.e it considers all the earlier words to choose the most likely next word. This memory of previous steps helps the network understand context and make better predictions.

Key Components of RNNs

There are mainly two components of RNNs.

1. Recurrent Neurons

The fundamental processing unit in RNN is a Recurrent Unit. They hold a hidden state that maintains information about previous inputs in a sequence. Recurrent units can "remember" information from prior steps by feeding back their hidden state, allowing them to capture dependencies across time.

2. RNN Unfolding

RNN unfolding or unrolling is the process of expanding the recurrent structure over time steps. During unfolding each step of the sequence is represented as a separate layer in a series illustrating how information flows across each time step.

This unrolling enables backpropagation through time (BPTT) a learning process where errors are propagated across time steps to adjust the network’s weights enhancing the RNN’s ability to learn dependencies within sequential data.

Recurrent Neural Network Architecture

RNNs share similarities in input and output structures with other deep learning architectures but differ significantly in how information flows from input to output. Unlike traditional deep neural networks where each dense layer has distinct weight matrices. RNNs use shared weights across time steps, allowing them to remember information over sequences.

In RNNs the hidden state $H_{i}$ is calculated for every input $X_{i}$ to retain sequential dependencies.

How does RNN work?

At each time step RNNs process units with a fixed activation function. These units have an internal hidden state that acts as memory that retains information from previous time steps. This memory allows the network to store past knowledge and adapt based on new inputs.

Updating the Hidden State in RNNs

The current hidden state $h_{t}$ depends on the previous state $h_{t - 1}$ and the current input $x_{t}$ and is calculated using the following relations:

1. State Update:

$h_{t} = f (h_{t - 1}, x_{t})$

where:

$h_{t}$ is the current state
$h_{t - 1}$ is the previous state
$x_{t}$ is the input at the current time step

2. Activation Function Application:

$h_{t} = \tanh (W_{h h} \cdot h_{t - 1} + W_{x h} \cdot x_{t})$

Here, $W_{h h}$ is the weight matrix for the recurrent neuron and $W_{x h}$ is the weight matrix for the input neuron.

3. Output Calculation:

$y_{t} = W_{h y} \cdot h_{t}$

where $y_{t}$ is the output and $W_{h y}$ is the weight at the output layer

These parameters are updated using backpropagation. However, since RNN works on sequential data here we use an updated backpropagation which is known as backpropagation through time.

Backpropagation Through Time (BPTT) in RNNs

Since RNNs process sequential data Backpropagation Through Time (BPTT) is used to update the network's parameters. The loss function L(θ) depends on the final hidden state $h_{3}$ and each hidden state relies on preceding ones forming a sequential dependency chain:

$h_{3}$ depends on $depends on h_{2}, h_{2} depends on h_{1}, \dots, h_{1} depends on h_{0}$ .

In BPTT, gradients are backpropagated through each time step. This is essential for updating network parameters based on temporal dependencies.

Types Of Recurrent Neural Networks

There are four types of RNNs based on the number of inputs and outputs in the network:

1. One-to-One RNN

This is the simplest type of neural network architecture where there is a single input and a single output. It is used for straightforward classification tasks such as binary classification where no sequential data is involved.

2. One-to-Many RNN

In a One-to-Many RNN the network processes a single input to produce multiple outputs over time. This is useful in tasks where one input triggers a sequence of predictions (outputs). For example in image captioning a single image can be used as input to generate a sequence of words as a caption.

3. Many-to-One RNN

The Many-to-One RNN receives a sequence of inputs and generates a single output. This type is useful when the overall context of the input sequence is needed to make one prediction. In sentiment analysis the model receives a sequence of words (like a sentence) and produces a single output like positive, negative or neutral.

4. Many-to-Many RNN

The Many-to-Many RNN type processes a sequence of inputs and generates a sequence of outputs. In language translation task a sequence of words in one language is given as input and a corresponding sequence in another language is generated as output.

Variants of Recurrent Neural Networks (RNNs)

There are several variations of RNNs, each designed to address specific challenges or optimize for certain tasks:

1. Vanilla RNN

This simplest form of RNN consists of a single hidden layer where weights are shared across time steps. Vanilla RNNs are suitable for learning short-term dependencies but are limited by the vanishing gradient problem, which hampers long-sequence learning.

2. Bidirectional RNNs

Bidirectional RNNs process inputs in both forward and backward directions, capturing both past and future context for each time step. This architecture is ideal for tasks where the entire sequence is available, such as named entity recognition and question answering.

3. Long Short-Term Memory Networks (LSTMs)

Long Short-Term Memory Networks (LSTMs) introduce a memory mechanism to overcome the vanishing gradient problem. Each LSTM cell has three gates:

Input Gate: Controls how much new information should be added to the cell state.
Forget Gate: Decides what past information should be discarded.
Output Gate: Regulates what information should be output at the current step. This selective memory enables LSTMs to handle long-term dependencies, making them ideal for tasks where earlier context is critical.

4. Gated Recurrent Units (GRUs)

Gated Recurrent Units (GRUs) simplify LSTMs by combining the input and forget gates into a single update gate and streamlining the output mechanism. This design is computationally efficient, often performing similarly to LSTMs and is useful in tasks where simplicity and faster training are beneficial.

How RNN Differs from Feedforward Neural Networks?

Feedforward Neural Networks (FNNs) process data in one direction from input to output without retaining information from previous inputs. This makes them suitable for tasks with independent inputs like image classification. However FNNs struggle with sequential data since they lack memory.

Recurrent Neural Networks (RNNs) solve this by incorporating loops that allow information from previous steps to be fed back into the network. This feedback enables RNNs to remember prior inputs making them ideal for tasks where context is important.

What is LSTM - Long Short Term Memory?

Long Short-Term Memory (LSTM) is an enhanced version of the Recurrent Neural Network (RNN). LSTMs can capture long-term dependencies in sequential data making them ideal for tasks like language translation, speech recognition and time series forecasting. Unlike traditional RNNs which use a single hidden state passed through time LSTMs introduce a memory cell that holds information over extended periods addressing the challenge of learning long-term dependencies.

Problem with Long-Term Dependencies in RNN

Recurrent Neural Networks (RNNs) are designed to handle sequential data by maintaining a hidden state that captures information from previous time steps. However they often face challenges in learning long-term dependencies where information from distant time steps becomes crucial for making accurate predictions for current state. This problem is known as the vanishing gradient or exploding gradient problem.

Vanishing Gradient: When training a model over time, the gradients which help the model learn can shrink as they pass through many steps. This makes it hard for the model to learn long-term patterns since earlier information becomes almost irrelevant.
Exploding Gradient: Sometimes gradients can grow too large causing instability. This makes it difficult for the model to learn properly as the updates to the model become erratic and unpredictable.

Both of these issues make it challenging for standard RNNs to effectively capture long-term dependencies in sequential data.

LSTM Architecture

LSTM architectures involves the memory cell which is controlled by three gates:

Input gate: Controls what information is added to the memory cell.
Forget gate: Determines what information is removed from the memory cell.
Output gate: Controls what information is output from the memory cell.

This allows LSTM networks to selectively retain or discard information as it flows through the network which allows them to learn long-term dependencies. The network has a hidden state which is like its short-term memory. This memory is updated using the current input, the previous hidden state and the current state of the memory cell.

Working of LSTM

LSTM architecture has a chain structure that contains four neural networks and different memory blocks called cells.

Information is retained by the cells and the memory manipulations are done by the gates. There are three gates -

1. Forget Gate

The information that is no longer useful in the cell state is removed with the forget gate. Two inputs $x_{t}$ (input at the particular time) and $h_{t - 1}$ (previous cell output) are fed to the gate and multiplied with weight matrices followed by the addition of bias. The resultant is passed through sigmoid activation function which gives output in range of [0,1]. If for a particular cell state the output is 0 or near to 0, the piece of information is forgotten and for output of 1 or near to 1, the information is retained for future use.
The equation for the forget gate is:
$f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})$
Where:
$W_{f}$ represents the weight matrix associated with the forget gate.
$[h_{t} - 1, x_{t}]$ denotes the concatenation of the current input and the previous hidden state.
$b_{f}$ is the bias with the forget gate.
$σ$ is the sigmoid activation function.

2. Input gate

The addition of useful information to the cell state is done by the input gate. First the information is regulated using the sigmoid function and filter the values to be remembered similar to the forget gate using inputs $h_{t - 1}$ and $x_{t}$ . Then, a vector is created using tanh function that gives an output from -1 to +1 which contains all the possible values from $h_{t - 1}$ and $x_{t}$ . At last the values of the vector and the regulated values are multiplied to obtain the useful information. The equation for the input gate is:
$i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i})$
${\hat{C}}_{t} = \tanh (W_{c} \cdot [h_{t - 1}, x_{t}] + b_{c})$
We multiply the previous state by $f_{t}$ effectively filtering out the information we had decided to ignore earlier. Then we add $i_{t} ⊙ C_{t}$ which represents the new candidate values scaled by how much we decided to update each state value.
$C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ {\hat{C}}_{t}$
where
$⊙$ denotes element-wise multiplication
tanh is activation function

3. Output gate

The output gate is responsible for deciding what part of the current cell state should be sent as the hidden state (output) for this time step.First, the gate uses a sigmoid function to determine which information from the current cell state will be output. This is done using the previous hidden state $h_{t - 1}$ and the current input $x_{t}$ :
$o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o})$
Next, the current cell state $C_{t}$ is passed through a tanh activation to scale its values between $- 1$ and $+ 1$ . Finally, this transformed cell state is multiplied element-wise with $o_{t}$ to produce the hidden state $h_{t}$ :
$h_{t} = o_{t} ⊙ \tanh (C_{t})$
Here:
$o_{t}$ is the output gate activation.
$C_{t}$ is the current cell state.
$⊙$ represents element-wise multiplication.
$σ$ is the sigmoid activation function.
This hidden state $h_{t}$ is then passed to the next time step and can also be used for generating the output of the network.

Applications

Some of the famous applications of LSTM includes:

Language Modeling: Used in tasks like language modeling, machine translation and text summarization. These networks learn the dependencies between words in a sentence to generate coherent and grammatically correct sentences.
Speech Recognition: Used in transcribing speech to text and recognizing spoken commands. By learning speech patterns they can match spoken words to corresponding text.
Time Series Forecasting: Used for predicting stock prices, weather and energy consumption. They learn patterns in time series data to predict future events.
Anomaly Detection: Used for detecting fraud or network intrusions. These networks can identify patterns in data that deviate drastically and flag them as potential anomalies.
Recommender Systems: In recommendation tasks like suggesting movies, music and books. They learn user behavior patterns to provide personalized suggestions.
Video Analysis: Applied in tasks such as object detection, activity recognition and action classification. When combined with Convolutional Neural Networks (CNNs) they help analyze video data and extract useful information.

Attention mechanism for LSTM used in a sequence-to-sequence task

The attention mechanism is a technique introduced in deep learning, particularly for sequence-to-sequence tasks, to allow the model to focus on different parts of the input sequence when producing an output. It helps to address the limitation of fixed-size internal memory (like the final state of an encoder in sequence-to-sequence models), by dynamically weighing the importance of different parts of the input for each step of the output.

Implementing attention mechanisms for LSTM

Define the LSTM: Let’s use an LSTM encoder-decoder structure. The encoder LSTM processes the input sequence and produces a sequence of hidden states. Each hidden state hₛ is typically the concatenation of the LSTM’s forward and backward cell states at time s.

Compute the Attention Scores: For each hidden state hₜ of the decoder LSTM, compute attention scores with all the encoder hidden states. One commonly used method is the dot-product scoring function:

Alternatively, other scoring functions like multiplicative or additive attention can be used. The multiplicative approach, for example, employs a weight matrix W:

Compute the Attention Weights: Normalize the scores to produce a probability distribution using the softmax function:

Here, aₜₛ represents the attention weight for encoder hidden state hₛ when decoding at time step t.

Compute the Context Vector: Calculate the context vector for the decoder time step t as a weighted sum of the encoder hidden states:

Concatenate or Combine the Context Vector: You can then combine the context vector cₜ with the decoder’s hidden state hₜ. Typically, this involves concatenating cₜ and hₜ and feeding this combination to a dense layer to generate either the input for the next LSTM layer or the output prediction for that time step.

Training: Train the model by backpropagating through both the LSTM and the attention mechanism. The goal is for the model to learn where to focus its attention in the input sequence to produce accurate outputs.

Decoding: During inference, utilize strategies like beam search, greedy decoding, etc., to produce the output sequence. The attention mechanism will provide a dynamic focus on different parts of the input sequence at each decoding step.

Architecture and Working of Transformers

Transformers are a type of deep learning model that utilizes self-attention mechanisms to process and generate sequences of data efficiently. They capture long-range dependencies and contextual relationships making them highly effective for tasks like language modeling, machine translation and text generation. Transformer model are built on encoder-decoder architecture where both the encoder and decoder are composed of a series of layers that utilize self-attention mechanisms and feed-forward neural networks. This architecture enables the model to process input data in parallel making it highly efficient and effective for tasks involving sequential data.

The encoder processes input sequences and creates meaningful representations.
The decoder generates outputs based on encoder representations and previously predicted tokens.

The encoder and decoder work together to transform the input into the desired output such as translating a sentence from one language to another or generating a response to a query.

1. Encoder

The primary function of the encoder is to create a high-dimensional representation of the input sequence that the decoder can use to generate the output. Encoder consists multiple layers and each layer is composed of two main sub-layers:

Self-Attention Mechanism: This sub-layer allows the encoder to weigh the importance of different parts of the input sequence differently to capture dependencies regardless of their distance within the sequence.
Feed-Forward Neural Network: This sub-layer consists of two linear transformations with a ReLU activation in between. It processes the output of the self-attention mechanism to generate a refined representation.

Layer normalization and residual connections are used around each of these sub-layers to ensure stability and improve convergence during training.

2. Decoder

Decoder in transformer also consists of multiple identical layers. Its primary function is to generate the output sequence based on the representations provided by the encoder and the previously generated tokens of the output.

Each decoder layer consists of three main sub-layers:

Masked Self-Attention Mechanism: Similar to the encoder's self-attention mechanism but its main purpose is to prevent attending to future tokens to maintain the autoregressive property (no cheating during generation).
Encoder-Decoder Attention Mechanism: This sub-layer allows the decoder to focus on relevant parts of the encoder's output representation. This allows the decoder to focus on relevant parts of the input, essential for tasks like translation.
Feed-Forward Neural Network: This sub-layer processes the combined output of the masked self-attention and encoder-decoder attention mechanisms.

In-Depth Analysis of Transformer Components

1. Multi-Head Self-Attention Mechanism

Multi-head attention extends the self-attention mechanism by applying it multiple times in parallel with each "head" learning different aspects of the input data. This allows the model to capture a richer set of relationships within the input sequence. The outputs of these heads are then concatenated and linearly transformed to produce the final output. The benefits include:

Improved ability to capture complex patterns in the data.
Enhanced model capacity without significant increase in computational complexity.

Mathematical Formulation:

Given an input sequence X the self-attention mechanism computes three matrices: queries Q, keys K and values V by multiplying X with learned weight matrices $W_{Q}$ , $W_{K}$ and $W_{V}$ .

$Q = X W_{Q}, K = X W_{K}, V = X W_{V}$

The attention scores are computed as:

$Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}})$

For multi-head attention, we apply self-attention multiple times:

$MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W_{O}$

where Where each head is computed as:

${head}_{i} = Attention (Q W_{Q}^{i}, K W_{K}^{i}, V W_{V}^{i})$

Where :

$W_{Q} i$ , $W_{K} i$ , $W_{V} i$ are learned projection matrices for the i-th head.

2. Position-wise Feed-Forward Networks

Each position in the sequence is independently processed using a feed-forward network:

$FFN (x) = \max (0, x W_{1} + b_{1}) W_{2} + b_{2}$

This helps the transformer learn complex representations of input features.

3. Positional Encoding

Transformers lack inherent information about the order of the input sequence due to their parallel processing nature. Positional encoding is introduced to provide the model with information about the position of each token in the sequence.
Positional encodings are added to the input embeddings to give the model a sense of token order. These encodings can be either learned or fixed.

4. Layer Normalization and Residual Connections

Layer Normalization: stabilizes training by normalizing inputs.
Residual Connections: help avoid vanishing gradients by adding input back to outputs

$Output = LayerNorm (x + SubLayer (x))$

This addition helps in preserving the original input information which is crucial for learning complex representations.

How Transformers Work

1. Input Representation

The first step in processing input data involves converting raw text into a format that the transformer model can understand. This involves tokenization and embedding.

Tokenization: The input text is split into smaller units called tokens, which can be words, sub words or characters. Tokenization ensures that the text is broken down into manageable pieces.
Embedding: Each token is then converted into a fixed-size vector using an embedding layer. This layer maps each token to a dense vector representation that captures its semantic meaning.
Positional encodings are added to these embeddings to provide information about the token positions within the sequence.

2. Encoder Process in Transformers

Input Embedding: The input sequence is tokenized and converted into embeddings with positional encodings added.
Self-Attention Mechanism: Each token in the input sequence attends to every other token to capture dependencies and contextual information.
Feed-Forward Network: The output from the self-attention mechanism is passed through a position-wise feed-forward network.
Layer Normalization and Residual Connections: Layer normalization and residual connections are applied.

3. Decoder Process

Input Embedding and Positional Encoding: The partially generated output sequence is tokenized and embedded with positional encodings added.
Masked Self-Attention Mechanism: The decoder uses masked self-attention to prevent attending to future tokens ensuring that the model generates the sequence step-by-step.
Encoder-Decoder Attention Mechanism: The decoder attends to the encoder's output allowing it to focus on relevant parts of the input sequence.
Feed-Forward Network: Similar to the encoder the output from the attention mechanisms is passed through a position-wise feed-forward network.
Layer Normalization and Residual Connections: Similar to the encoder Layer normalization and residual connections are applied.

4. Training and Inference

Transformers are trained with teacher forcing, where the correct previous tokens are provided during training to predict the next token. Their encoder-decoder architecture combined with multi-head attention and feed-forward networks enables highly effective handling of sequential data.
Transformers have transformed deep learning by using self-attention mechanisms to efficiently process and generate sequences capturing long-range dependencies and contextual relationships. Their encoder-decoder architecture combined with multi-head attention and feed-forward networks enables highly effective handling of sequential data.

Ethical considerations and societal impacts of Generative AI

1. Distribution of harmful content

Generative AI systems can create content automatically based on text prompts by humans. "These systems can generate enormous productivity improvements, but they can also be used for harm, either intentional or unintentional,". An AI-generated email sent on behalf of the company, for example, could inadvertently contain offensive language or issue harmful guidance to employees. GenAI should be used to augment but not replace humans or processes, Greenstein advised, to ensure content meets the company's ethical expectations and supports its brand values.

2. Copyright and legal exposure

Popular generative AI tools are trained on massive image and text databases from multiple sources, including the internet. When these tools create images or generate lines of code, the data's source could be unknown, which might be problematic for a bank handling financial transactions or a pharmaceutical company relying on a formula for a complex molecule in a drug. Reputational and financial risks could also be massive if one company's product is based on another company's intellectual property.

Generative AI large language models (LLMs) are trained on data sets that might include personally identifiable information (PII) about individuals. This data can sometimes be elicited with a simple text prompt.

4. Sensitive information disclosure

GenAI is democratizing AI capabilities and making them more accessible. This combination of democratization and accessibility, could potentially lead to a medical researcher inadvertently disclosing sensitive patient information or a consumer brand unwittingly exposing its product strategy to a third party. The consequences of unintended incidents like these could irrevocably breach patient or customer trust and carry legal ramifications.

5. Amplification of existing bias

Generative AI can potentially amplify existing bias. For example, there can be bias in the data used for training LLMs, which can be outside the control of companies that use these language models for specific applications. It's critically important for companies working on AI to have diverse leaders and subject matter experts to help identify bias in data and models, Greenstein said.

6. Workforce roles and morale

AI is being trained to do more of the daily tasks that knowledge workers do, including writing, coding, content creation, summarization and analysis, Greenstein said. Although worker displacement and replacement have been ongoing since the first AI and automation tools were deployed, the pace has accelerated as a result of the innovations in generative AI technologies.

Ethical responses have included investments in preparing certain parts of the workforce for the new roles created by generative AI applications. Businesses, for example, will need to help employees develop generative AI skills such as prompt engineering. "The truly existential ethical challenge for adoption of generative AI is its impact on organizational design, work and ultimately on individual workers," said Nick Kramer, vice president of applied solutions at consultancy SSA & Company. "This will not only minimize the negative impacts, but it will also prepare the companies for growth."

7. Data provenance

GenAI systems consume tremendous volumes of data that could be inadequately governed, of questionable origin, used without consent or biased. Additional levels of inaccuracy could be amplified by social influencers or the AI systems themselves.

"The accuracy of a generative AI system depends on the corpus of data it uses and its provenance,". "ChatGPT-4 is mining the internet for data, and a lot of it is truly garbage, presenting a basic accuracy problem on answers to questions to which we don't know the answer." FICO, has been using generative AI for more than a decade to simulate edge cases in training fraud detection algorithms. The generated data is always labeled as synthetic data, so Zoldi's team knows where the data is allowed to be used. "We treat it as walled-off data for the purposes of test and simulation only," he said. "Synthetic data produced by generative AI does not inform the model going forward in the future.

8. Lack of explainability and interpretability

Many generative AI systems group facts together probabilistically, going back to the way AI has learned to associate data elements with one another. But these details aren't always revealed when using applications like ChatGPT. Consequently, data trustworthiness is called into question.

When interrogating GenAI, analysts expect to arrive at a causal explanation for outcomes. But machine learning models and generative AI search for correlations, not causality. "That's where we humans need to insist on model interpretability -- the reason why the model gave the answer it did,". "And truly understand if an answer is a plausible explanation versus taking the outcome at face value."

Until that level of trustworthiness can be achieved, GenAI systems should not be relied upon to provide answers that could significantly affect lives and livelihoods.

9. AI hallucinations

Generative AI techniques all use various combinations of algorithms, including autoregressive models, autoencoders and other machine learning algorithms, to distill patterns and generate content. As good as these models are at identifying new patterns, they sometimes struggle with teasing out important distinctions relevant to human use cases.

This can include creating authoritative-sounding but inaccurate prose or producing pictures with realistic-looking imagery but misshapen representations of humans that contain extra fingers or eyes. With language models, these errors can show up as chatbots inaccurately representing corporate policies, such as in the case of an Air Canada chatbot that misrepresented corporate policies regarding bereavement benefits. Lawyers using these tools have also been fined for filing briefs that cited nonexistent court cases.

Newer techniques like retrieval augmented generation and agentic AI frameworks can help reduce these issues. However, it's important to keep humans in the loop to verify the accuracy of generative AI information to avoid customer backlash, sanctions or other problems.

10. Carbon footprint

Many AI vendors argue that bigger AI models can deliver better results. This is partly true, but it can often involve considerably more data center resources, either for training new AI models or running AI inference processes in production. The issue is hardly clear-cut. As some argue, improving an AI model that has the potential to reduce the carbon footprint of an employee traveling to work or the efficiency of a product could be a good thing. Conversely, developing that model could also exacerbate global warming or other environmental problems

11. Political impact

The political impact of GenAI technologies is a fraught topic. On the one hand, better GenAI tools have the potential to make the world a better place. At the same time, they could also enable various political actors -- voters, politicians, authoritarians -- to make communities worse. One example of generative AI's negative impact on politics can be found in social media platforms that algorithmically promote or create divisive comments as a strategy for increasing engagement (and profits) for their owners over comments that find common ground but might not have the same click-through and sharing numbers.

Comments

Python Full Stack DeveloperAugust 13, 2025 at 12:56 PM
It’s very useful information. For more details, visit Top Generative AI Online Training in Ameerpet . Thank you.
John data analystSeptember 9, 2025 at 4:05 PM
Great read! I agree your points about data and analytics service .
Elearning CompanyOctober 23, 2025 at 6:49 PM
Great article! I completely agree with your points about effective online training. If anyone is looking for a professional eLearning company to create interactive courses, gamified content, and microlearning programs, I highly recommend Mentop Learning. Their solutions really help boost employee engagement and skills.
WickOctober 30, 2025 at 3:32 PM
Impressive insights! The rise of Generative AI Development Services is transforming industries across the globe. With advanced Generative AI Solutions and Generative AI Services, businesses can now innovate faster. A reputed Generative AI Company or Generative AI Services Company helps enterprises integrate AI Analytics Services and even Generative AI for Healthcare. Partnering with a skilled Generative AI Development Company ensures access to next-gen Gen AI Solutions and expert teams when you Hire Generative AI Developers for scalable growth.

Generative AI using python