Simultaneous Matrix Diagonalization

Simultaneous diagonalization of a group of matrices plays an important role in identifying irreducible representations of a group from its regular representation. But before delving into simultaneous diagnolization of regular matrix represenation of all elements of a group, let us briefly discuss how a single matrix can be diagnonalized.

Diagnonalization of a Single Matrix

Let matrix \(A\) be a \(n \times n\) matrix, which has following eigenvalues:

\(m\) distinct real eignevalues, \(\lambda_1, \lambda_2, \lambda_3, .... , \lambda_m\)
One complex conjugate pair, \(\lambda + i\omega, \lambda - i\omega\)
\(\mu\) with algebraic and geometric mutiplicity 2, \(s.t. (A- \mu I)\) has a 2-D NULL space.
\(\gamma\) with algebraic multiplicity 2 and geometric mutiplicity 1, \(s.t. (A- \gamma I)\) has a 1-D NULL space.

Direct and Semidirect Products of Groups

Direct and Semidirect products are techniques to construct larger, more complex groups from simpler, smaller ones. In this post we shall understand how direct and semidirect products on groups are performed and what is the relation between these two operations.

Normal Subgroups and Quotient Groups

Groups, just like composite numbers, can be composed from other simpler groups. These simpler groups are easier to study and reveal wealth of information about the structure of the original group.

Normal subgroups are a mathematical tool that allow partitioning of a group into a smaller group, called the quotient group. In this post, we shall delve into normal groups and quotient groups in detail.

Basics of Groups and Group Representations

The natural world around us is rife with examples of transformations applied to objects that retain some underlying property of the object before and after the transformation. For example, rigid motions in 3D-space that preserve a particular molecule. A simple example is rotation of a 2D square ( in \(XY\) plane ) about the \(Z\) axis. A 0, 90, 180 and 270 degree rotation has no impact on the square shape and orientation. Such transformations are called symmetries.Two symmetries can be combined to form another symmetry and every symmetry has an opposite or inverse symmetry. For instance, 90 and 180 degree rotation can be combined to produce a 270 degree rotation that does not change a square. Also,a rotation counterclockwise by 90 degrees can be undone by rotation clockwise by 90 degrees. Groups are a collection of transformations that preserve some underlying structure of a set to which they are applied.

Expectation Maximization

In this post we shall explore expectation maximization, a mathematical tool used in many generative AI algorithms, in detail. Expectation maximization is a means to calculate the maximum likelihood estimate in presence of latent variables.

Information Theory and its Applications to Machine Learning

Information theory is a subfield of mathematics that deals with the quantification of the information in events, random variables, and distributions and storing it in a fashion such that it is robust to errors ( channel encoding and error correction). The field was proposed and developed by Claude Shannon while working at the US telephone company Bell Labs to quantify information for communication. This post provides an introduction to basic concepts of information theory and their application to machine learning.

Noise Contrastive Estimation and Negative Sampling

Noise Contrastive Estimaion (NCE) is a parameter estimation technique used to model a data distribution by learning a classifier that distinguishes data from the true distribution vs a noise distribution. It avoids calculation of partition functions and its derivatives at each training step that is a very computationally demanding step in many techniques used to model data distribution.

Eigenvector Centrality

Eigenvector centrality is one the most fundamental mechanism used to assign importance to a node in a graph. While Degree Centrality’ assigns importance to a node based on the number of its neighboring nodes, Eignevector Centrality assigns importance to a node in a graph based not only based on the number of its neightbors but also their importance. i.e. a node’s importance is proportional to the sum of the importance of it’s neighbors.

Introduction to Lagrangian Duality

As we have previously disucssed in this article, optimization techniques are used in machine learning to find decision function in the hypothesis set closest to the emperical risk minimizer for the problem. The decision functions for a machine learning problem could be constrained or unconstrained. In this article, we will be focussing on the Constrained Optimization problems and how Lagrangian Duality can be used to solve them.

General constrained optimization problem can be defined as:

min f(x)
s.t. \(h_i(x) \leq 0\) , i = 1,2,…m
\(e_j(x)\) = 0, j = 1,2,…,p

Introduction to Convex Programming

In the previous article, we discussed that we cannot evaluate all possible decision functions for a machine learning problem, due to time and processing constraints. Hence, we restrict the set of candidate functions to a set called hypothesis space. Hypothesis space typically includes those functions that are easy to work with and optimize. One such class of functions is called Convex Functions and the task of finding the optimal value of such functions is termed as Convex Programming problem. More formally,

Introduction to Statistical Learning Theory

Supervised machine learning techniques are used to solve problems which cannot be framed as closed-form equations. These techniques aim to find a decision function that best fits data points consisting of inputs and the corresponding outputs generated empirically, called training data. The process involves:

Choosing candidate functions amongst which a decision function shall be chosen.
Evaluating the candidate functions to choose a decision function amongst them which best fit the training data and also generalize well when evaluated against test data that is collected from the same I.I.D as the training data.

Statistical learning theory helps us ditermine the size of the set of candidate decision functions to be evaluated, and the benchmarks to be used to evaluate them, for a given supervised machine learning problem. It also helps us understand the role played by optimization techniques in this process and how to choose the most appropriate ones for a given problem.

Forward and Back Propagation in Computational Graphs

A computational graph is defined as a directed graph where the nodes correspond to mathematical operations. It is used as a way of expressing and evaluating a mathematical expression. Each node of a computational graph has one or two incoming edges carrying the inputs and one outgoing edge representing the result of the operation performed by the node on the inputs. There are two basic operations performed on a computational graph:
Forward Propagation: Forward propagation involves computing the values of all nodes in the graph to arrive at the final value of the mathematical expression represented by the graph.
Back Propagation: Back propagation involves calculating partial derivatives of the objective function of the grapth w.r.t the inputs of a graph node using the partial derivatives of the objective function w.r.t. it’s output. These partial derivatives explain how the objective function of the graph varies w.r.t to its inputs

Introduction to Spectral Graph Theory

In mathematics, spectral graph theory is the study of the properties of a graph in relationship to the characteristic polynomial, eigenvalues and eigenvectors of matrices associated with the graph, such as its adjacency matrix or Laplacian matrix. A spectrum of the graph is defined as eigenvectors of a graph ordered by the magnitude of their corresponding eigenvalues.
Whether to express a graph as an adjacency matrix of Laplacian matrix depends on the motivation or information that needs to be extracted from the graph. If the information required is number of walks of any length n between one vertex and another we use the adjacency matrix to express the graph. If the motivation is to perform graph partitioning, Laplacian matrix is used. In this article we shall focus on undirected graphs represented as both Adjacency and Laplacian matrices for these purposes.