Linear Discriminant Analysis – from Theory to Code
Linear Discriminant Analysis (LDA) is an important tool in both Classification and Dimensionality Reduction technique. Most of the text book covers this topic in general, however in this Linear Discriminant Analysis - from Theory to Code tutorial we will understand both the mathematical derivations, as well how to implement as simple LDA using Python code. I believe you should be confident about LDA after going through the post end to end.
Prerequisite
You need to understand basic Linear Algebra in order to follow in this tutorial. Specifically you should know about Vector, Matrix and Eigen Decomposition.
Objective
Linear Discriminant Analysis can be used for both Classification and Dimensionality Reduction. The basic idea is to find a vector w
which maximizes the separation between target classes after projecting them onto w
. Refer the below diagram for a better idea, where the first plot shows a non-optimal projection of the data points and the 2nd plot shows an optimal projection of the data point so that the classes are well separated.
LDA is a supervised algorithm ( Unlike PCA ), hence we need to have the target classes specified.
Credit: Above picture has been taken from “Pattern Recognition and Machine Learning” by “Christopher Bishop”
Orthogonal Projection
As per the objective, we need to project the input data onto w
vector. Assume that we already know w (I will show how to find optimal w later). Now let’s derive the equation for orthogonally projecting a vector x
( input vector ) onto vector w.
In the below diagram the vector p
is called the Orthogonal Projection
of x
onto w
. The d
vector is the perpendicular distance between x and w. This is also known as the residual
or error vector
.
Since p
is parallel to w
, we can write,
We can rewrite d
as,
We know that if two vectors are orthogonal to each other than the dot product between them will be 0. In this case p and d are orthogonal to each other, hence we can write them as,
\[\begin{align} p^Td & = 0 \\ (cw)^T(x-cw) & = 0 \\ cw^Tx - c^2wTw & = 0 \\ cw^Tw & = w^Tx \\ c & = \frac{w^Tx}{w^Tw} \end{align}\]We can now rewrite the expression of p as,
\[\begin{align} p & = \left ( \frac{w^Tx}{w^Tw} \right ) w \end{align}\]Optimal Linear Discriminant
First we will develop the general concept on building an Optimal Linear Discriminant, then work on the objective function later.
Assume w to an unit vector
, which means,
We can write the orthogonal projection of any n-dimensional vector \(\hat{x_i}\) onto the vector w as,
\[\begin{align} \hat{x_i} & = \left ( \frac{w^Tx_i}{w^Tw} \right ) w \\ & = \left ( w^Tx_i \right ) w \\ & = a_i w \end{align}\]Above \(a_i\) is the offset/coordinate of \(\hat{x_i}\) along the line w
. A set of these scaler values \(\left \{ a_1, a_2 ... a_n \right \}\) represents the mapping from \(\mathbf{R}^d \rightarrow \mathbf{R}\). This means if we know the value of \(w\), we can reduce the dimension of the data from d-dimension to 1-dimension.
Dataset Assumption
Just for simplicity, assume our dataset has only 2 target class and 2 input fields. This will help us to visualize the results effectively. So we can say,
\[\begin{align} x_i \in \mathbf{R}^d \\ \text{where } d = 2 \end{align}\]Also, we can define \(D\) as following, \(\begin{align} D_i = \left \{ x_j|y_j = c_i \right \} \end{align}\)
Here \(y_i\) is the target labels and \(c_i\) is the target class. \(D_i\) is the subset of the datasets labelled with class \(c_i\)
Mean of the Projected Points
Since we already know the target class labels, we can easily calculate the mean of the projected data points for each class separately as,
\[\begin{align} m_1 & = \frac{1}{n_1} \sum_{x_i \in D_1} a_i \\ & = \frac{1}{n_1} \sum_{x_i \in D_1} w^Tx_i \\ & = w^T \left ( \frac{1}{n_1} \sum_{x_i \in D_1} x_i \right ) \\ & = w^T \mu_1 \end{align}\]Here, \(\mu_1\) is the mean of all input data points in \(D_1\)
Very similarly, we can define \(m_2\) as,
\[\begin{align} m_1 & = w^T \mu_2 \end{align}\]Two Important Ideas
In order to maximize the separation between the classes ( in the projected space ), we can think of maximizing the difference between the projected means \(|m_1-m_2|\).
However this condition alone is not sufficient to make sure in the projected space the data points are separated for each class. We need to think about the variance of the projected data in each class. A large variance will lead to overlapping data points among two classes ( we have only 2 classes in our data ).
So we need to look at minimizing the variance in class too. LDA
does not use the variance directly, rather uses the following formulation.
\(s_i^2\) is the total squared deviation from the mean ( remember, variance is the mean deviation). This is also known as scatter matrix
.
Fisher’s LDA
We can incorporate the above two ideas,
- Maximize the distance between projected means
- Minimize the sum of the projected scatter
into one equation, named as Fisher’s LDA.
\[\begin{align} \max_w J(w) = \frac{(m_1-m_2)^2}{s_1+s_2} \end{align}\]So the goal of LDA is to find the vector \(w\) which maximizes \(J(w)\). This vector \(w\) is also called as Optimal Linear Discriminant
.
Rewrite the Equation using InputData
The above objective function is in projected space, so let’s express it using the input data, since \(w\) is still unknown.
\[\begin{align} (m_1-m_2)^2 = & (w^T\mu_1 - w^T\mu_2)^2 \\ = & [w^T(\mu_1-\mu_2)]^2 \\ = & w^T(\mu_1-\mu_2)(\mu_1-\mu_2)^Tw \\ = & w^TBw \end{align}\]Here \(B\) is called between class scatter matrix
. It will be a dxd
matrix.
Above, \(S_1\) is the scatter matrix for \(D_1\).
Similarly we can define \(S_2\).
\[\begin{align} s_2^2 = & w^TS_2w \end{align}\]We can combine the above equations,
\[\begin{align} s_1^2+s_2^2 & = w^TS_1w+w^TS_2w\\ & = w^T(S_1+S_2)w \\ & = w^TSw \end{align}\]Now, we can rewrite the LDA Objective Function
as,
Solving the Objective Function
You might have already guessed, in order to solve for the best direction of \(w\) we need to differentiate the objective function w.r.t \(w\) and set that to zero.
Recall that if \(f(x)\) and \(f(x)\) are two functions, when we can define the derivative w.r.t \(x\) as,
\[\frac{d}{dx} \bigg( \frac{f(x)}{g(x)} \bigg) = \frac{f{\prime}(x)g(x) - g(x){\prime}f(x)}{g(x)^2}\]We can use the above formula to differentiate our cost/objective function.
\[\begin{align} \frac{d}{dw} J(w) = & \frac{(2Bw)(w^TSw) - (2Sw)(w^TBw)}{(w^TSw)^2} = 0 \\ Bw(w^TSw)= & Sw(w^TBw) \\ Bw =& Sw \bigg ( \frac{w^TBw}{w^TSw} \bigg) \\ Bw =& J(w) Sw \\ Bw =& \lambda Sw \\ S^{-1}Bw =& \lambda S^{-1}Sw \\ (S^{-1}B)w =& \lambda w \\ \end{align}\]If the \(S\) not a singular
matrix ( inverse exists ) then the above equation can be considered as general Eigenvalue-Eigenvector
decomposition. \(\lambda = J(w)\) is Eigenvalue and \(w\) is the Eigenvector of the matrix \(S^{-1}B\).
Now all we have to do is calculate the Eigen decomposition of \(S^{-1}B\), then get the Eigenvector corresponding to the largest Eigenvalue. That will be our optimized best valued \(w\) vector.
LDA for Multi-Class
In case you want to use LDA for more than 2 classes, there will be a small change in the way we calculate \(B\). The scatter matrix \(S\) still be the same.
\[\begin{align} B = \sum_{i = 1}^C n_i (\mu_i - \mu)(\mu_i - \mu)^T \end{align}\]Where,
- \(C\) = numbest of target classes
- \(n_i\) = Number of dataset in each class
- \(\mu_i\) = Mean of the datasets for specific class
- \(\mu\) = Mean of the entire dataset
LDA Python Implementation
First we will create a LDA class, implement the logics there and then run different experiments.
1
2
3
4
5
6
class LDA:
def __init__(self):
pass
def fit(self, X, y):
pass
fit()
Then inside the fit()
function, first get the target classes, then calculate the mean for each class. Store them in mean_vectors
list.
1
2
3
4
5
def fit(self, X, y):
target_classes = np.unique(y)
mean_vectors = []
for cls in target_classes:
mean_vectors.append(np.mean(X[y == cls], axis=0))
If the number of class is 2, then simply calculate the B matrix as per the above equation.
1
2
mu1_mu2 = (mean_vectors[0] - mean_vectors[1]).reshape(1, X.shape[1])
B = np.dot(mu1_mu2.T, mu1_mu2)
In case there are more than 2 classes, we need to first calculate the mean of the entire dataset. Then using that calculate the B
matrix.
1
2
3
4
5
6
7
8
data_mean = np.mean(X, axis=0).reshape(1, X.shape[1])
B = np.zeros((X.shape[1], X.shape[1]))
for i, mean_vec in enumerate(mean_vectors):
n = X[y == i].shape[0]
mean_vec = mean_vec.reshape(1, X.shape[1])
mu1_mu2 = mean_vec - data_mean
B += n * np.dot(mu1_mu2.T, mu1_mu2)
Now its time to create the S matrix. Define an empty array s_matrix
to store the scatter matrix for each class. Then loop through the mean_vectors
and each data points to calculate the Si
. Append each value to the list. Again, we are just following above formula. In case you are confused, please refer the equations.
1
2
3
4
5
6
7
8
s_matrix = []
for cls, mean in enumerate(mean_vectors):
Si = np.zeros((X.shape[1], X.shape[1]))
for row in X[y == cls]:
t = (row - mean).reshape(1, X.shape[1])
Si += np.dot(t.T, t)
s_matrix.append(Si)
Crete the S
matrix and loop through the s_matrix
list to append all the values.
1
2
3
S = np.zeros((X.shape[1], X.shape[1]))
for s_i in s_matrix:
S += s_i
Calculate the \(S^{-1}\) using np.linalg.inv()
then calculate the \(S^{-1}B\). Finally call np.linalg.eig()
to get the eigen vector and eigen values.
1
2
3
4
5
S_inv = np.linalg.inv(S)
S_inv_B = S_inv.dot(B)
eig_vals, eig_vecs = np.linalg.eig(S_inv_B)
The np.linalg.eig()
function does not sort the eigen values, so we need to do that programmatically. Use the argsort()
function, then reverse the array by using [::-1]
. Use the index to order the eig_vals
and eig_vecs
. Finally return the eig_vecs
.
1
2
3
4
5
6
idx = eig_vals.argsort()[::-1]
eig_vals = eig_vals[idx] # Not needed
eig_vecs = eig_vecs[:, idx]
return eig_vecs
load_data()
We will be using the iris dataset. Hence need a function to load the datasets as needed for running various experiments. I am not explaining this load_data()
, feel free to post comments if you need any help.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def load_data(cols, load_all=False, head=False):
iris = sns.load_dataset("iris")
if not load_all:
if head:
iris = iris.head(100)
else:
iris = iris.tail(100)
le = preprocessing.LabelEncoder()
y = le.fit_transform(iris["species"])
X = iris.drop(["species"], axis=1)
if len(cols) > 0:
X = X[cols]
return X.values, y
Experiment 1: 2 Classes, 2 Column Dataset
We will use petal_length
and petal_width
only. Also we will use only 2 classes from the dataset. Call load_data()
. The shape of X
will be (100, 2)
1
2
3
cols = ["petal_length", "petal_width"]
X, y = load_data(cols, load_all=False, head=True)
print(X.shape)
Invoke fit()
by passing the X
and y
, which will return all the eigen vectors.
1
2
lda = LDA()
W = lda.fit(X, y)
We can just get the first eigen vector since we want to reduce the dimension to 1.
1
W = eig_vecs[:, :1]
We will loop through each data point, plot them using ax.scatter()
then use the Orthogonal Projection
to calculate the projected space.
1
2
3
4
5
6
7
8
9
colors = ['red', 'green', 'blue']
fig, ax = plt.subplots(figsize=(10, 8))
for point, pred in zip(X, y):
ax.scatter(point[0], point[1], color=colors[pred], alpha=0.3)
proj = (np.dot(point, W) * W) / np.dot(W.T, W)
ax.scatter(proj[0], proj[1], color=colors[pred], alpha=0.3)
plt.show()
Here is the plot below. You can see the LDA algorithm has found the plane to separate the data optimally.
Experiment 2: 3 Classes, 2 Column Dataset
We will use the full iris
dataset with 3 classes. Pass load_all=True
in the load_data()
function. There are no other changes required.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cols = ["petal_length", "petal_width"]
X, y = load_data(cols, load_all=True, head=True)
print(X.shape)
lda = LDA()
eig_vecs = lda.fit(X, y)
W = eig_vecs[:, :1]
colors = ['red', 'green', 'blue']
fig, ax = plt.subplots(figsize=(10, 8))
for point, pred in zip(X, y):
ax.scatter(point[0], point[1], color=colors[pred], alpha=0.3)
proj = (np.dot(point, W) * W) / np.dot(W.T, W)
ax.scatter(proj[0], proj[1], color=colors[pred], alpha=0.3)
plt.show()
Here is the plot. You can see that all three classes are well separated in 1 dimension.
Experiment 3: 3 Classes, 4 Column Dataset
Now, if we want to reduce the dimension of the data from 4 to 2, we can take the 2 eigen vectors and them transform the data using X.dot(W)
.
1
2
3
4
5
6
7
8
9
10
11
X, y = load_data([], load_all=True, head=True)
print(X.shape)
lda = LDA()
eig_vecs = lda.fit(X, y)
W = eig_vecs[:, :2] # Take 2 eigen vectors
transformed = X.dot(W)
plt.scatter(transformed[:, 0], transformed[:, 1], c=y, cmap=plt.cm.Set1)
plt.show()
Here is the projected data in 2 dimensions. As you can see we are able to reduce the dimension from 4 to 2 using LDA without loosing much information.
sklearn library
Use can use built in LinearDiscriminantAnalysis
class from the sklearn library.
1
2
3
4
5
6
7
8
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis()
clf.fit(X, y)
transformed = clf.transform(X)
plt.scatter(transformed[:, 0], transformed[:, 1], c=y, cmap=plt.cm.Set1)
plt.show()
The plot looks the same, it’s just the flipped.
Conclusion
Linear Discriminant Analysis (LDA) is simple yet powerful tool. Often PCA and LDA are compared, however LDA is Supervised Learning Method and PCA is Unsupervised Learning Method. There are other extensions of LDA are available, such as Kernel LDA, QDA etc.
You can find the full code in GitHub.