Vectors

Definition and notation

A vector \(\mathbf{x} \in \mathbb{R}^{n}\) is an ordered list of \(n\) numbers, typically represented as a column:

\[ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \]

Special Vectors:

  • Zero vector: \(\mathbf{0} = [0, 0, \dots, 0]^\top\)
  • Ones vector: \(\mathbf{1} = [1, 1, \dots, 1]^\top\)
  • Standard basis vector: \(\mathbf{e}_i\) has at position \(i\) and \(0\) elsewhere.

Example:

import numpy as np

a = np.array([0, -1, 9, 0.2])
print("a = ", a)
print("Shape:", a.shape)

Click to see the output
a =  [ 0.  -1.   9.   0.2]
Shape: (4,)

Where do we can find the vectors representitive in machine learning?

Vector is the most common notation used to denote many things in machine learning. For example, in regression, or classification, we usually have the input which represented as a vector, typically called feature vector. Another, appearance of vector object is the process of calculation in the machine learning model when we usually have vector, matrix operators happens, e.g., matrix multiplication. Also, in neural network, at the end of the architecture, we have a linear layer which act as a weighted sum of the previous layer’s output, or an probablistic layer which convert to a confident vector for regression, classification task respectively.

Where do we see the use of vectors in quantum computing?

A quantum state of a qubit, or a set of qubits is usually represeted as a complex vector, \(\ket{\psi}\) in Hilbert space (usually denoted as \(\mathcal{H}\)).

Scaling and addition operations

Scalar Multilication: Multiply each element by a scalar \(c\), i.e.,

\[ \mathbf{y} = \lambda \mathbf{x} = \begin{bmatrix}\lambda x_1 \\ \lambda x_2 \\ \vdots \\ \lambda x_n\end{bmatrix} \]

Vector addtion: element-wise sum, i.e.,

\[ \mathbf{z} = \mathbf{x} + \mathbf{y} = \begin{bmatrix}x_1 + y_1 \\ x_2 + y_2 \\ \vdots \\ x_n + y_n\end{bmatrix}. \]

Python example

a = np.array([1, -1, 2, 3])
b = np.array([0, 0.223, -12, 0])
print("a + b =", a + b)  # Output: [1.0, -0.777, -10.0, 3.0]
print("7 * b =", 7 * b)  # Output: [0.0, 0.669, -36.0, 0.0]
Click to see the output
a + b = [  1.     -0.777 -10.      3.   ]
7 * b = [  0.      1.561 -84.      0.   ]

Vector transpose

This operation converts a column to a row vector (and vice versa):

\[ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \mapsto \mathbf{x}^\top = \begin{bmatrix}x_1 & x_2 & \dots, x_n\end{bmatrix}. \]

Property: \(\left(\mathbf{x}^\top\right)^\top = \mathbf{x}\).

Inner product

\[ \begin{aligned} f: \ \mathbb{R}^n\times \mathbb{R}^n &\rightarrow \mathbb{R}\\ (\mathbf{x}, \mathbf{y}) &\mapsto \langle \mathbf{x}, \mathbf{y} \rangle , \end{aligned} \]

where

\[ \langle \mathbf{x}, \mathbf{y} \rangle \triangleq \mathbf{x}^\top \mathbf{y} = \sum_{i=1}^{n}x_iy_i \]

For example, compute the inner product between \(\mathbf{x} = \begin{bmatrix}1 & -2 & 0.5\end{bmatrix}^\top\) and \(\mathbf{y} = \begin{bmatrix}-2 & 0 & 1\end{bmatrix}^\top\)

\[ \langle\mathbf{x}, \mathbf{y}\rangle = \begin{bmatrix}1 & -2 & 0.5\end{bmatrix}\begin{bmatrix}-2 \\ 0 \\ 1\end{bmatrix} = 1\times (-2) + (-2)\times 0 + 0.5\times 1 = -1.5 \]
x = np.array([1, -2, 0.5])
y = np.array([-2, 0, 1])
print("Inner product x@y = ", x @ y)
Click here to see the output.
Inner product x @ y =  -1.5

Outer product

\[ \begin{align} f: \ \mathbb{R}^m\times \mathbb{R}^n &\rightarrow \mathbb{R}^{m \times n} \\ (\mathbf{x}, \mathbf{y}) &\mapsto \mathbf{x}\otimes \mathbf{y} := \mathbf{x}\mathbf{y}^\top, \end{align} \]

For example, compute the inner product between \(\mathbf{x} = \begin{bmatrix}3 & -2 & 0\end{bmatrix}^\top\) and \(\mathbf{y} = \begin{bmatrix}-2 & -1 & -9\end{bmatrix}^\top\)

\[ \mathbf{x}\otimes \mathbf{y} = \mathbf{x}\mathbf{y}^\top = \begin{bmatrix} 3 \\ -2 \\ 0 \end{bmatrix}\begin{bmatrix}-2 & -1 & -9\end{bmatrix} = \begin{bmatrix} -6 & -3 & -27 \\ 4 & 2 & 18 \\ 0 & 0 & 0 \end{bmatrix} \]
x = np.array([3, -2, 0])
y = np.array([-2, -1, -9])
print("Outer product: \n", np.outer(x, y))
Click here to see the output.
Outer product: [[ -6 -3 -27] [ 4 2 18] [ 0 0 0]]

Matrices

Definition and notation

A matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) is a rectangular array with \(m\) rows and \(n\) columns:

\[ \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix} \]

Matrix tranpose applies on a \(m \times n\) matrix resulting an \(n \times m\) matrix where the output matrix is constructed from the input matrix through swapping rows and columns, i.e.,

\[ \mathbf{A}^\top = \begin{bmatrix} a_{11} & a_{21} & \cdots & a_{m1} \\ a_{12} & a_{22} & \cdots & a_{m2} \\ \vdots & \vdots & \ddots & \vdots \\ a_{1n} & a_{2n} & \cdots & a_{mn} \end{bmatrix} \]

Special matrices:

  • Square matrix: the number of rows equals to the number of columns, i.e., \(m = n\).
  • Identity matrix, \(\mathbf{I}_D\), contains all zeros, apart from the diagonal with ones only.
  • Symmetric matrix: if \(\mathbf{A}\) is a \(n\times n\) matrix and \(\mathbf{A} = \mathbf{A}^\top\), then \(\mathbf{A}\) is called square.

Example:

A = np.array([
    [1, -2], 
    [0, 3], 
    [9, 2]
])
print("A = \n", A)
print("Shape:", A.shape)
Click to see the output
A = 
[[  1.    -2. ]
 [  0.    -3.]
 [  9.     2.]]
Shape: (3, 2)

Scaling and addition

Scalar multiplication:

\[ \mathbf{B} = c \mathbf{A} = \begin{bmatrix} c a_{11} & c a_{12} & \cdots & c a_{1n} \\ c a_{21} & c a_{22} & \cdots & c a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ c a_{m1} & c a_{m2} & \cdots & c a_{mn} \\ \end{bmatrix} \]

Matrix addition: element-wise sum (matrices must have the same shape), i.e.,

\[ \begin{aligned} \mathbf{C} = \mathbf{A} + \mathbf{B} &= \begin{bmatrix} a_{11}+b_{11} & a_{12}+b_{12} & \cdots & a_{1n}+b_{1n} \\ a_{21}+b_{21} & a_{22}+b_{22} & \cdots & a_{2n}+b_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1}+b_{m1} & a_{m2}+b_{m2} & \cdots & a_{mn}+b_{mn} \\ \end{bmatrix} \end{aligned} \]

Python example

A = np.array([
    [1, -1, 0, 3.4],
    [-2, 3, 0.23, 1.1],
    [0, 0, -100, -10]
])
k = 3.4
B = k*A
C = A + B
AT = A.T
print("A: = {}".format(A))
print("B = {}A = {}".format(k, B))
print("C = A + B = {}".format(C))
print("A.T = \n", AT)
Click to see the output
A: = [[   1.     -1.      0.      3.4 ]
 [  -2.      3.      0.23    1.1 ]
 [   0.      0.   -100.    -10.  ]]
B = 3.4A = [[   3.4     -3.4      0.      11.56 ]
 [  -6.8     10.2      0.782    3.74 ]
 [   0.       0.    -340.     -34.   ]]
C = A + B = [[   4.4     -4.4      0.      14.96 ]
 [  -8.8     13.2      1.012    4.84 ]
 [   0.       0.    -440.     -44.   ]]
 A.T = 
 [[   1.     -2.      0.  ]
 [  -1.      3.      0.  ]
 [   0.      0.23 -100.  ]
 [   3.4     1.1   -10.  ]]

Matrix multiplication

The product \(\mathbf{C} = \mathbf{A}\mathbf{B}\) is defined if \(\mathbf{A} \in \mathbb{R}^{m\times n}\), and \(\mathbf{B} \in \mathbb{R}^{n\times p}\), where

\[ C_{ij} = \sum_{k=1}^n a_{ik}b_{kj}. \]

Key properties:

  • Asscociative: \(\left(\mathbf{A}\mathbf{B}\right)\mathbf{C} = \mathbf{A}\left(\mathbf{B}\mathbf{C}\right)\)
  • Distributive: \(\mathbf{A}\left(\mathbf{B}+\mathbf{C}\right) = \mathbf{A}\mathbf{B} + \mathbf{A}\mathbf{C}\)
  • In general, with proper dimension, matrix multiplication is not commutative, i.e., \(\mathbf{A}\mathbf{B} \neq \mathbf{B}\mathbf{A}\)

Example:

\[ \begin{aligned} & \ \begin{bmatrix} 1 & 2 & 1 \\ 3 & 4 & 2 \end{bmatrix}\begin{bmatrix} -2.3 & 2 \\ 3 & 2 \\ -2.3 & 0 \end{bmatrix} \\ &= \begin{bmatrix}1\times (-2.3) + 2\times 3 + 1 \times (-2.3) & 1\times 2 + 2\times 2 + 1 \times 0 \\ 3\times(-2.3) + 4\times 3+ 2\times (-2.3) & 3\times 2 + 4\times 2 + 2\times 0\end{bmatrix} \\ &= \begin{bmatrix}1.4 & 6 \\ 0.5 & 14\end{bmatrix} \end{aligned} \]
A = np.array([
    [1, 2, 1],
    [3, 4, 2]
])
B = np.array([
    [-2.3, 2],
    [3, 2],
    [-2.3, 0]
])
print("AB = \n", A @ B)
Click here to see the output
AB = 
 [[ 1.4  6. ]
 [ 0.5 14. ]]

Matrix-vector product

Transform a vector \(\mathbf{x} \in \mathbb{R}^{n}\) into \(\mathbf{y} \in \mathbb{R}^{m}\). i.e.,

\[ \mathbf{y} = \mathbf{A} \mathbf{x} = \sum_{j=1}^{n}x_j\mathbf{A}_{:, j} \]

Geometric interpretation: linear combination of \(\mathbf{A}\)’s columns weighted by \(\mathbf{x}\).

A = np.array([
    [1, 2, 1],
    [3, 4, 2]
])
x = np.array([1, -3, 0])
print("Ax =", A @ x)
Click here to see the output
Ax = [-5 -9]

Norms of a vector and matrix

A norm of a vector/matrix object is, in plain langage, a measure of the “length” of the object. Precisely, a norm function \(f: \mathbb{R}^{n} \rightarrow \mathbb{R}\) has to satisfies 4 properties:

  • \(\forall \mathbf{x} \in \mathcal{D}, \ f(\mathbf{x}) \ge 0\)
  • \(f(\mathbf{x}) = 0 \ \text{iif} \ \mathbf{x} = 0\)
  • \(\forall \ \mathbf{x} \in \mathcal{D}, t \in \mathbb{R}, \ f(t\mathbf{x}) = |t| f(\mathbf{x})\)
  • \(\forall \mathbf{x}, \mathbf{y} \in \mathcal{D}, \ f(\mathbf{x}+\mathbf{y}) \le f(\mathbf{x})+f(\mathbf{y})\)

Vector norms

A \(p\)-norm of a vector \(\mathbf{x}\), denoted as \(||\mathbf{x}||_p\), with \(p \ge 1\), is defined as

\[ ||x||_p = \left(\sum_{i=1}^{n}|x_i|^{p}\right)^{1/p}. \]

\(2\)-norm, or Euclidean norm

\[ ||x||_2 = \sqrt{\sum_{i=1}^{n}|x_i|^{2}}. \]

\(1\)-norm, or Euclidean norm

\[ ||x||_1 = \sum_{i=1}^{n} |x_i|. \]

Max-norm

\[ ||x||_\infty = \lim_{p\rightarrow \infty}\left(\sum_{i=1}^{n}|x_i|^{p}\right)^{1/p} = \max_{i=1, \dots, n} |x_i|. \]

\(0\)-norm, or pseudo norm, which counts the number of non-zero elements in \(\mathbf{x}\), i.e.,

\[ ||x||_0 = \sum_{i=1}^{n} \mathbb{I}(|x_i| > 0). \]

Matrix norms

To approach the norm of the matrix we now think a matrix as a linear transformation. Consider a matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\), we define the induced norm of \(\mathbf{A}\) as the maximum amount by which \(f\) can lengthen any unit-norm input:

\[ ||\mathbf{A}||_p = \max_{\mathbf{x} \neq \mathbf{0}} \frac{||\mathbf{A}\mathbf{x}||_p}{||\mathbf{x}||_p} = \max_{||\mathbf{x}||=1}||\mathbf{A}\mathbf{x}||_p \]

With \(p = 2\):

\[ ||\mathbf{A}||_2 = \sqrt{\lambda_{\max}\left(\mathbf{A}^\top \mathbf{A}\right)} = \max_{i} \sigma_i, \]

where \(\lambda_{\max}(\mathbf{A}^\top\mathbf{A})\) is the largest eigenvalue of \(\mathbf{A}^\top \mathbf{A}\), and \(\sigma_i\) is the \(i\)-th singular value of \(\mathbf{A}\)

With \(p = 1\), nulear norm, or trace norm

\[ ||\mathbf{A}||_{1} \equiv ||\mathbf{A}||_{*} = \text{tr}\left(\sqrt{\mathbf{A}^\top \mathbf{A}}\right) = \max_{i} \sigma_i \]

Obviously, since singular values \(\sigma_i \ge 0\), therefore

\[ ||\mathbf{A}||_{*} = ||\boldsymbol{\sigma}||_1. \]

If we think matrix as a vector, we can derive the Frobenius norm which is the sum of square of each elements in the matrix

\[ ||\mathbf{A}||_{F} = \sqrt{\sum_{i=1}^m\sum_{j=1}^n a_{ij}^2} = \sqrt{\text{tr}(\mathbf{A}^\top \mathbf{A})}. \]

In case of evaluating \(\mathbf{A}\) is costly expensive, we use approximate Frobenius norm through the Hutchinson trace estimattion which results:

\[ ||\mathbf{A}||_F^2 = \text{tr}(\mathbf{A}^\top \mathbf{A}) = \mathbb{E}\left[\left(\mathbf{A}\mathbf{v}\right)^\top \mathbf{A}\mathbf{v}\right] = \mathbb{E}\left[||\mathbf{A}\mathbf{v}||_2^2\right], \]

where the probability density function of \(\mathbf{v}\) is zero-mean, identity-covariance Gaussian distribution, i.e., \(f_{pdf}(\mathbf{v}) = \mathcal{N}(\mathbf{0}, \mathbf{I})\).

Special types of matrices

Diagonal matrix

  • Non-zero entries only on the main diagonal
\[ \mathbf{D} = \textrm{diag}(d_1, \dots, d_n) = \begin{bmatrix}d_1 & 0 & \cdots & 0 \\ 0 & d_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_n \end{bmatrix} \]

Triangular matrix:

  • Upper: non-zero entries on/above the diagonal.
  • Lower: non-zero entries on/below the diagonal.

Example:

D = np.diag([3, -2, 5])  # Diagonal matrix
U = np.triu([[1,2,3], [4,5,6], [7,8,9]])  # Upper triangular
L = np.tril([[1,2,3], [4,5,6], [7,8,9]])
print("D = \n", D)
print("U = \n", U)
print("L = \n", L)
Click here to see the output.
D = 
 [[ 3  0  0]
 [ 0 -2  0]
 [ 0  0  5]]
U = 
 [[1 2 3]
 [0 5 6]
 [0 0 9]]
 L = 
 [[1 0 0]
 [4 5 0]
 [7 8 9]]

Orthogonal matrices

A matrix \(\mathbf{U} \in \mathbb{R}^{n \times n}\) is orthogonal iff:

\[ \mathbf{U}^\top \mathbf{U} = \mathbf{U}\mathbf{U}^\top = \mathbf{I} \]

Properties:

  • Norm preservation, i.e., \(||\mathbf{U}\mathbf{x}||_2 = ||\mathbf{x}||_2\)
  • Angular preservation, i.e., \(\angle(\mathbf{U}\mathbf{x}, \mathbf{U}\mathbf{y}) = \angle(\mathbf{x}, \mathbf{y})\)
  • Columns form an orthonormal basis.

Note that, the orthogonal matrix term is called unitary operator, or unitary matrix in copmlex vector space, e.g., in quantum computing, the unitary operations always govern the evolution of the quantum system. Some other uses of orthogonal matrix could be found in QR decomposition applications.

Positive definte matrices

A symmetric matrix \(\mathbf{A} \in \mathbb{S}^{n} \), where \(\mathbb{S}^{n} \triangleq \{\mathbf{X} \in \mathbb{R}^{n \times n}: \mathbf{X}^\top= \mathbf{X}\}\), is postive definite, which is denoted as \(\mathbf{A} \succ 0\), iff

\[ \mathbf{x}^\top \mathbf{A}\mathbf{x} > 0, \quad \ \forall \ \mathbf{x} \neq \mathbf{0}. \]

Usage: covairance matrices in statistics, Hessians in optimization.

Matrices properties

Trace of a square matrix

\[ \textrm{tr}(\mathbf{A}) \triangleq \sum_{i=1} a_{ii} \]

Properties of trace operator:

\[ \begin{align} \textrm{tr}(\mathbf{A}) &= \textrm{tr}\left(\mathbf{A}^\top\right) \\ \textrm{tr}(\mathbf{A} + \mathbf{B}) &= \textrm{tr}(\mathbf{A}) + \textrm{tr}(\mathbf{B}) \\ \textrm{tr}(c \mathbf{A}) &= c\textrm{tr}(\mathbf{A}) \\ \textrm{tr}(\mathbf{A}\mathbf{B}) &= \textrm{tr}(\mathbf{B}\mathbf{A}) \\ \textrm{tr}(\mathbf{A}) &= \sum_{i=1}^{n}\lambda_i, \end{align} \]

where \(\lambda_i\) are eigenvalues of \(\mathbf{A}\).

Cyclic permutation property: for \(\mathbf{A} \in \mathbb{R}^{n \times l}, \ \mathbf{B} \in \mathbb{R}^{l\times p}, \ \mathbf{C}\in \mathbb{R}^{p \times n}\) such that \(\mathbf{A}\mathbf{B}\mathbf{C} \) square, we have

\[ \textrm{tr}(\mathbf{A}\mathbf{B}\mathbf{C}) = \textrm{tr}(\mathbf{C}\mathbf{A}\mathbf{B}) = \textrm{tr}(\mathbf{B}\mathbf{C}\mathbf{A}) \]

Determinant of a square matrix

The determinant of a square matrix, denoted as \(\det(\mathbf{A})\), or \(|\mathbf{A}|\), is a measure of how much it changes a unit volume when viewed as a linear combination. Using Laplace expansion, we have

\[ \det(\mathbf{A}) = \begin{cases} \sum_{j=1}^{n} (-1)^{i+j}a_{ij}\det(\textbf{M}^{(ij)}), \ &\textrm{if} \ n > 2 \\ a_{11}a_{22} - a_{21}a_{12}, \ &\textrm{if} \ n = 2 \\ A_{11} , \ &\textrm{if} \ n = 1 \end{cases} \]

where \(\mathbf{M}^{(ij)}\) is a \((n-1)\times (n-1)\) matrix from \(\mathbf{A}\) that results from \(\mathbf{A}\) by removing the \(i-th\) row, and \(j\)-th column.

Properties of the determinant:

\[ \begin{align} |\mathbf{A}| &= |\mathbf{A}^\top| \\ |c\mathbf{A}| &= c^n |\mathbf{A}| \\ |\mathbf{A}\mathbf{B}| &= |\mathbf{A}||\mathbf{B}| \\ |\mathbf{A}| &= 0 \ \text{iff} \ \mathbf{A} \ \text{is singular} \\ |\mathbf{A}^{-1}| |\mathbf{A}| &= 1 \ \textrm{if} \ \mathbf{A} \ \text{is not singular} \\ |\mathbf{A}| &= \prod_{i=1}^{n} \lambda_i \end{align} \]

where \(\mathbf{A}, \mathbf{B} \in \mathbb{R}^{n \times n}\).

For a positive definite matrix \(\mathbf{A}\), we can write \(\mathbf{A} = \mathbf{L}\mathbf{L}^\top\), where \(\mathbf{L}\) is the lower triangular Cholesky decomposition. In this case, we can compute the determinant of \(\mathbf{A}\) as

\[ \det(\mathbf{A}) = \det(\mathbf{L}\mathbf{L}^\top) = \det{\mathbf{L}}\det{\mathbf{L}^\top} = \det(\mathbf{L})^2. \]\[ \log{\det(\mathbf{A})} = \log{ \det(\mathbf{L})^2} = 2 \log{\det(\mathbf{L})} = 2\log{\prod_i L_{ii}} = 2\textrm{tr}(\log{\textrm{diag}(\mathbf{L})}) \]
Click here to see the proof.

Proof of \(\log{\prod_i L_{ii}} = \textrm{tr}(\log{\textrm{diag}(\mathbf{L})})\) by using log of product equals to sum of log operation rule:

\[ \log{\prod_i L_{ii}} = \sum_{i=1} \log{L_{ii}} = \textrm{tr}(\log{\textrm{diag}(\mathbf{L}})) \]

Rank of a matrix

The column rank of a matrix \(\mathbf{A}\) is the dimension of the space spanned by its columns

\[ r_c(\mathbf{A}) = \text{dim}(\textrm{col}(\mathbf{A})) = \text{dim}(\textrm{span}(\{\mathbf{A}_{:,j}\}_{j=1}^{n})) \]\[ r_r(\mathbf{A}) = \text{dim}(\textrm{row}(\mathbf{A})) = \text{dim}(\textrm{span}(\{\mathbf{A}_{i,:}\}_{i=1}^{m})) \]

It is proved that the column rank and row rank of a matrix \(\mathbf{A}\) are the same, i.e.,

\[ r(\mathbf{A}) \triangleq r_c(\mathbf{A}) = r_r(\mathbf{A}). \]

Rank’s properties

\[ \mathbf{A} \in \mathbb{R}^{m \times n}, \ r(\mathbf{A}) \le \min{(m, n)}. \]

If the equality holds, \(\mathbf{A}\) is said to be full rank, o.w called rank deficient.

\[ \begin{align} r(\mathbf{A}) &= r(\mathbf{A}^\top) = r(\mathbf{A}^\top \mathbf{A}) = r(\mathbf{A}\mathbf{A}^\top) \\ r(\mathbf{A}\mathbf{B}) &\le \min{(r(\mathbf{A}), r(\mathbf{B}))} \end{align} \]

Given \(\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times}\), we have

\[ r(\mathbf{A} + \mathbf{B}) \le r(\mathbf{A}) + r(\mathbf{B}) \]

The square matrix is invertible iff it is full rank.

A tall rectangular matrix is invertible (left pseudoinverse) iff its rank equals to the number of columns, which is called full column rank.

A wide rectangular matrix is invertible (left pseudoinverse) iff its rank equals to the number of rows, which is called full row rank.

Condition numbers

The condition number of a matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) is a measure of how numerically stable any computations involving \(\mathbf{A}\) will be. By general definition, it is

\[ \kappa(\mathbf{A}) \triangleq ||\mathbf{A}|| \le 1. \]

Therefore, the condition number \(\kappa(\cdot)\), pronounced “kappa” depends on which norm used. Typically, the \(l_2\)-norm is calculated unless state otherwise. In case of \(l_2\)-norm, we have

\[ \kappa{(\mathbf{A})} = \frac{\sigma_{\max}}{\sigma_{\min}} = \sqrt{\frac{\lambda_{\max}}{\lambda_{\min}}} \]

The condition number is a better measure of nearness to the singularity than the size of the determinant. If a matrix \(\mathbf{A}\) whose \(\kappa(\mathbf{A})\) is small (close to \(1\)), it is called well-conditioned. If its condition number is large, it is ill-conditioned.

Vector spaces

Linear independence, spans and basis sets

A set of vectors \(S = \{\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n\}\) is said to be (linearly) independent if no vector can be represented as a linear combination of the remaining vectors. Specially, the following linear system may only have one unique solution \(\boldsymbol{\alpha} = \mathbf{0}\) to suffice the linearly independent property for \(\mathbf{S}\)

\[ \alpha_1\mathbf{x}_1 +\alpha_2 \mathbf{x}_2 +\cdots + \alpha_n\mathbf{x}_n = \mathbf{0}. \]

We can rewrite it in a compact form as

\[ \mathbf{X}\boldsymbol{\alpha} = \mathbf{0}, \]

with \(\mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_n]\).

The span of a set of vectors \(S = \{\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n\}\) is the set of all vectors that can be expressed as a superposition of \(\{\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_n\}\), i.e.,

\[ \textrm{span}(\{\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_n\}) \triangleq \left\{\mathbf{v}: \ \mathbf{v} = \sum_{i=1}^{n} \alpha_i \mathbf{x}_i, \ \ \alpha_i \in \mathbb{R}\right\} \]

A basis \(\mathcal{B}\) is a set of linearly independent vectors that spans the whole space, meaning that its span is \(\mathbb{R}^n\). A obviious basis is the standard basis which use coordiate vectors \(\mathbf{e}_1 = [1, 0, \cdots, 0]^\top\), up to \(\mathbf{e}_{n} = [0, 0, \dots, 1]^\top\).

Range and nullspace of a matrix

Column space of \(\mathbf{A}\), usually called range of \(\mathbf{A}\), is the span of the columns of \(\mathbf{A}\), i.e.,

\[ \text{col}(\mathbf{A}) := \textrm{span}\left(\{\mathbf{A}_{:,j}\}_{j=1}^{n}\right) \triangleq \{\mathbf{v}\in \mathbb{R}^{m}: \mathbf{v} = \mathbf{A}\mathbf{x}, \ \mathbf{x} \in \mathbb{R}^{n}\} \]

Row space of \(\mathbf{A}\), similarly is defined as the span of the rows of \(\mathbf{A}\), i.e.,

\[ \textrm{row}(\mathbf{A}) := \textrm{span}\left(\{\mathbf{A}_{i,:}\}_{i=1}^{m}\right) \triangleq \{\mathbf{v}\in \mathbb{R}^{n}: \mathbf{v} = \mathbf{A}^\top\mathbf{y}, \ \mathbf{y} \in \mathbb{R}^{m}\} \]

Nullspace of a matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) is the set of all vectors that get mapped to the null vector when multiplied \(\mathbf{A}\), i.e,

\[ \textrm{nullspace}(\mathbf{A}) \triangleq \{\mathbf{x} \in \mathbb{R}^{n}: \mathbf{A} \mathbf{x} = \mathbf{0}\}. \]

It is always true that, row space and nullspace of a matrix \(\mathbf{A}\) are orthogononal to each other, or we can denote

\[ \begin{aligned} & \textrm{null}(\mathbf{A}) = \textrm{row}(\mathbf{A})^{\perp} \\ \Leftrightarrow \ & \textrm{row}(\mathbf{A}) = \textrm{null}(\mathbf{A})^\top \end{aligned} \]

The rank of the nullspace and the row space of a matrix \(\mathbf{A}\) are summed to \(m\), i.e.,

\[ \textrm{r}(null(\mathbf{A})) + \textrm{r}(row(\mathbf{A})) = m. \]

Linear projection

A projection \(\pi\) is characterized by a span of set of vectors \(\mathcal{A} = \{\mathbf{a}_k \in \mathbb{R}^{m}\}_{k=1}^{n}\), which transform a input vector \(\mathbf{y} \in \mathbb{R}^{m}\) to a vector \(\mathbf{x} \in \textrm{span}\{\mathbf{a}_1, \dots, \mathbf{x}_n\}\) such that the projected \(\mathbf{x}\) is as close as possible to \(\mathbf{y}\). The common measurement is use to quantify the distance is Euclidean norm \(||\mathbf{x} - \mathbf{y}||_2\). The problem is stated precisely as

\[ \pi_{\mathcal{A}}(\mathbf{y}) = \textrm{argmin}_{\mathbf{x} \in \mathcal{R}(\mathcal{A})} ||\mathbf{x} - \mathbf{y}||_2, \]

Let \(\mathbf{A} = [\mathbf{a}_1, \dots, \mathbf{a}_n] \in \mathbb{R}^{m \times n}\), since \(\mathbf{x} \in \mathcal{R}(\mathcal{A})\), we can represent \(\mathbf{x}\) as a superposition in the basis \(\mathcal{A}\), i.e.,

\[ \mathbf{x} = \mathbf{A}\boldsymbol{\alpha}. \]

Therefore, the projected \(\pi_{\mathcal{A}}(\mathbf{y}) \) can be found equivalently as

\[ \pi_{\mathcal{A}}(\mathbf{y}) = \mathbf{A} \left(\textrm{argmin}_{\boldsymbol\alpha \in\mathbb{R}^{n}} ||\mathbf{A}\boldsymbol\alpha - \mathbf{y}||_2^2\right) \]

Obviously, this is an convex optimization wrt \(\boldsymbol\alpha\), therefore, we can use the stationary condition which pull out the gradient of the objective function and set it to \(\mathbf{0}\), i.e.,

\[ \begin{aligned} \nabla_{\boldsymbol\alpha} ||\mathbf{A}\boldsymbol\alpha - \mathbf{y}||_2^2 &= 2\mathbf{A}^\top\mathbf{A}\boldsymbol\alpha - 2\mathbf{A}^\top \mathbf{y} \\ \end{aligned} \]

By setting the express to \(\mathbf{0}\), we have

\[ \mathbf{A}^\top \mathbf{A}\boldsymbol\alpha = \mathbf{A}^\top \mathbf{y} \]

Now, if \(\mathbf{A}^\top\mathbf{A} \in \mathbb{R}^{n \times n}\) is full rank, i.e., \(r(\mathbf{A}^\top\mathbf{A}) = n\), or equivalently \(\mathbf{A}\) is full column rank (which means \(\mathcal{A} \) is linearly independent), we get the optimal weights as

\[ \boldsymbol{\alpha}^{*} = \left(\mathbf{A}^\top\mathbf{A}\right)^{-1} \mathbf{A}^\top \mathbf{y}. \]

Plugging into the projection formula, we have the closed-form solution of the linear projection, i.e.,

\[ \pi_{\mathcal{A}}(\mathbf{y}) = \mathbf{A}\left(\mathbf{A}^\top\mathbf{A}\right)^{-1} \mathbf{A}^\top \mathbf{y}. \]

Useful manipulating data matrices operations

Sum over rows

\[ \begin{aligned} f: \mathbb{R}^{N \times D} &\rightarrow \mathbb{R}^{1\times D} \\ \mathbf{X} &\mapsto \mathbf{1}_N^\top \mathbf{X}, \end{aligned} \]

where \(\mathbf{1}_N^\top \mathbf{X}\) can be expressed explicited as

\[ \mathbf{1}^\top_N\mathbf{X} = \sum_{i=1}^{N}\mathbf{X}_{i,:} \]

Sum over columns

\[ \begin{aligned} f: \mathbb{R}^{N \times D} &\rightarrow \mathbb{R}^{N} \\ \mathbf{X} &\mapsto \mathbf{X}\mathbf{1}_D, \end{aligned} \]

where \(\mathbf{X}\mathbf{1}_{D}\) can be expressed explicited as

\[ \mathbf{X}\mathbf{1}_{D} = \sum_{j=1}^{D}\mathbf{X}_{:, j} \]

Sum all entries

\[ \begin{aligned} f: \mathbb{R}^{N \times D} &\rightarrow \mathbb{R} \\ \mathbf{X} &\mapsto \mathbf{1}_N^\top\mathbf{X}\mathbf{1}_D, \end{aligned} \]

where \(\mathbf{1}_N^\top\mathbf{X}\mathbf{1}_D\) can be expressed explicited as

\[ \mathbf{1}_N^\top\mathbf{X}\mathbf{1}_D = \sum_{i=1}^{N}\sum_{j=1}^{D} x_{ij}. \]

Duplicate a column to a matrix

\[ \begin{aligned} f: \mathbb{R}^{D} &\rightarrow \mathbb{R}^{N \times D} \\ \mathbf{x} &\mapsto \mathbf{x}\mathbf{1}_{N}^\top. \end{aligned} \]

Duplicate a row vector to a matrix

\[ \begin{aligned} f: \mathbb{R}^{N} &\rightarrow \mathbb{R}^{N \times D} \\ \mathbf{x}^\top &\mapsto \mathbf{1}_{D}\mathbf{x}^\top. \end{aligned} \]

Inversion operator

Inverse of a square matrix

The inverse of a square matrix \(\mathbf{A} \in \mathbb{R}^{n\times n}\) is denoted as \(\mathbf{A}^{-1}\), and is defined such that

\[ \mathbf{A}^{-1}\mathbf{A} = \mathbf{A}\mathbf{A}^{-1} = \mathbf{I} \]

The equiavalent condition for \(\mathbf{A}^{-1}\) exists is \(\det(\mathbf{A})\neq 0\). In case of \(\det(\mathbf{A}) = 0\), we call the matrix \(\mathbf{A}\) is singular.

Inverse operator’s properties (assumedly there exists the inverse of \(\mathbf{A}\) and \(\mathbf{B}\)):

\[ \begin{aligned} \left(\mathbf{A}^{-1} \mathbf{A}\right)^{-1} &= \mathbf{A}\\ (\mathbf{A}\mathbf{B})^{-1} &= \mathbf{B}^{-1}\mathbf{A}^{-1}\\ \left(\mathbf{A}^{-1}\right)^\top &= \left(\mathbf{A}^\top\right)^{-1} \triangleq \mathbf{A}^{-\top} \end{aligned} \]

Schur complements and block matrix inversion

Consider the matrix \(\mathbf{M} \in \mathbb{R}^{n \times n}\) which is represented as partioned blocks below

\[ \mathbf{M} = \begin{bmatrix} \mathbf{E}_{m_1\times n_1} & \mathbf{F}_{m_1\times n_2} \\ \mathbf{G}_{m_2\times n_1} & \mathbf{H}_{m_2 \times n_2} \end{bmatrix}, \]

with \(m_1 + m_2 = n, \ n_1 + n_2 = n\).

Under assumption, there is exixsting the inverse of \(\mathbf{H}\), the Schur complement of \(\mathbf{M}\) wrt \(\mathbf{H}\) denoted \(\mathbf{M}/\mathbf{H}\) which is defined as

\[ \mathbf{M}/\mathbf{H} \triangleq \mathbf{E} - \mathbf{F}\mathbf{H}^{-1}\mathbf{G} \]

Similarly, we define the Schur complement of \(\mathbf{M}\) wrt \(\mathbf{E}\) as

\[ \mathbf{M}/\mathbf{E} \triangleq \mathbf{H} - \mathbf{G}\mathbf{E}^{-1}\mathbf{F} \]

Then the inverse of \(\mathbf{M}\) can be found through the partitioned inverse formula (some literatures called it block matrix inversion) as

\[ \begin{align} \mathbf{M}^{-1} = \begin{bmatrix} \mathbf{E} & \mathbf{F}\\ \mathbf{G} & \mathbf{H} \end{bmatrix}^{-1} = \begin{bmatrix} \left(\mathbf{M}/\mathbf{H}\right)^{-1} & -\left(\mathbf{M}/\mathbf{H}\right)^{-1}\mathbf{F}\mathbf{H}^{-1} \\ -\mathbf{H}^{-1}\mathbf{G}\left(\mathbf{M}/\mathbf{H}\right)^{-1} & \mathbf{H}^{-1} + \mathbf{H}^{-1}\mathbf{G}\left(\mathbf{M}/\mathbf{H}\right)^{-1}\mathbf{F}\mathbf{H}^{-1} \end{bmatrix} \end{align} \]

Notice that, since the blocks of \(\mathbf{M}^{-1}\) require inversion opeartor, e.g., \(\left(\mathbf{M}/\mathbf{H}\right)^{-1}\), one way we can compute the inverse of \(\mathbf{M}\) is defining an recursive procudure with the base case is directly computing the inverses of \(2\times 2\) matrices. This method give us a favor of limited computational power, where we do not have computing resource (e.g., memory, RAM) to compute the inverse of high problem size, \(n \times n\) by solving \(2\times 2\) subproblems, but it is expensive in terms of running time complexity since its recursive nature.

Pseudoinverse of a matrix

It is straighfoward to find the inverse of a square matrix, what if the matrix is non-square? How do we find an inverse of rectangular matrix?

Supposedly, we are given a matrix \(\mathbf{X} \in \mathbb{R}^{N \times D}\), in general, \(N, D\) are arbitrary positive integers. Since the one of properties of rank, we have \(r(\mathbf{X}) \le \min\{N, D\}\). If \(r(\mathbf{X}) < \min\{N, D\}\) then there is no existing the inverse, (also pseudoinverse) of \(\mathbf{X}\).

Supposely, we have \(D\le N\) and \(r(\mathbf{X}) = D\), the (left) pseudoinverse of \(\mathbf{X}\) exists, and is defined as

\[ \mathbf{X}^{\dagger} \triangleq (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top. \]

The reason we call it left pseudoinverse is the inverse \((\mathbf{X}^\top \mathbf{X})^{-1}\) is on the left side.

In case of \(N \le D\) and \(r(\mathbf{X}) = N\), then the (right) pseudoinverse of \(\mathbf{X}\) exists, and is defined as

\[ \mathbf{X}^{\dagger} \triangleq \mathbf{X}^\top (\mathbf{X}\mathbf{X}^\top)^{-1}. \]

Pseudoinverse of a vector

One of the forgoten quesion is do we have an inverse of a \(n\)-dimensional vector? Well, it does exist cause vector is a special kind of matrix. Supposedly, \(\mathbf{x} \in \mathbb{R}^{n \times 1}\), we have the left pseudoinverse of \(\mathbf{x}\) is defined as

\[ \begin{aligned} \mathbf{x}^{\dagger} &= \left(\mathbf{x}^\top \mathbf{x}\right)^{-1}\mathbf{x}^\top \\ &= \left(||\mathbf{x}||_2^{2}\right)^{-1}\mathbf{x}^\top \\ &= \frac{\mathbf{x}^\top}{||\mathbf{x}||_2^2}. \end{aligned} \]

Interestingly, the inverse of a (column) vector is its tranpose normalized by its norm square. If \(\mathbf{x}\) is real, unit-magnitude vector, i.e., \(||\mathbf{x}||_2 = 1\), then \(\mathbf{x}^\dagger = \mathbf{x}^\top\) (if \(\mathbf{x}\) is unit-magnitude, but belongs to complex vector space, we use the conjugate tranpose instead of only tranpose, i.e., \(\mathbf{x}^\dagger = \mathbf{x}^{*}\)).

Here we an interesting connect with quantum computing is the relationship between ket and bra. A ket, denoted as \(\ket{\psi}\) represents for a quantum state of a full quantum system, or subsystem, which is an column vector lies in Hilbert space (a type of complex vector space). A bra, denoted as \(\bra{\psi}\) is defined as the (left) pseudoinverse of \(\ket{\psi}\), i.e.,

\[ \bra{\psi} = \ket{\psi}^{\dagger} = \frac{\ket{\psi}^{*}}{\braket{\psi, \psi}}, \]

where \(\mathbf{x}^{*}\) is conjugate transpose of \(\mathbf{x}\), e.g., \(\begin{bmatrix}1 \\ -i\end{bmatrix}^{*} = \begin{bmatrix}1 & i\end{bmatrix}\).

To ensure the quantum property, i.e., \(||\ket{w}||_2^2 = 1\), or \(\braket{\psi, \psi} = 1\), then we have

\[ \bra{\psi} = \ket{\psi}^{*}. \]

It means that computing the \(\bra{\psi}\) needss to do the conjugate transpose operator on \(\ket{\psi}\).

Inverse of a unitary matrix

Supposedly, we are given a unitary matrix \(\mathbf{U} \in \mathbb{C}^{N}\), we would like to find its inverse, which is \(\mathbf{U}^\dagger\). By unitary definition, we have

\[ \mathbf{U}^{*} \mathbf{U} = \mathbf{U}\mathbf{U}^{*} = \mathbf{I}, \]

where \(\mathbf{A}^{*}\) denotes for the conjugate transpose operator.

Obviously, from the definion, the inverse of \(\mathbf{U}\) is its conjugate tranpose, i.e.,

\[ \mathbf{U}^\dagger \triangleq \mathbf{U}^{*}. \]

This is a very nice property of unitary operator, since it is apparently effortless to find its inversion. Comparing to a general matrix size \(n \times n\), supposedly full rank, the cost for finding the inverse using classical approach, e.g., Singular value decomposition (SVD), is \(\mathcal{O}(n^3)\), while for finding the inverse of the unitary matrices is simply taking the tranpose of the input matrix, and potentially the conjugate step, which cost \(\mathcal{O}(n^2)\).