# Matrix multiplication is faster than you expect (part I)

Suppose we’re running a particle physics/neurobiology simulation, or solving some machine learning/data mining problem, and we need to do some linear algebra. Specifically, we want to multiply two real $n \times n$ matrices $A$, $B$. To compute the $i,j$-th entry of the product we go along the $i$th row of $A$, multiplying entrywise with the $j$th column of $B$, so that for each entry of $AB$ we perform $n$ real multiplications (and $n-1$ real additions, but this makes no difference to the “average” number of operations needed, so we’ll be slack and gloss over such details until we formalize things in the next few sections). Since there are $n^2$ entries in the product, this gives a total of $n^3$ operations required to calculate the product, as one would intuitively expect. As a slightly unrealistic example, let us generously assume that our matrices are small, of dimension $n \approx 10^5$ (which incidentally is just outside the largest matrix size allowed by GNU Octave on my computer), then assuming the ability to perform $10^{10}$ operations a second on a “standard” PC processor, one matrix multiplication would take on the order of $10^5$ seconds, or slightly over a day. In real-world applications such as high-performance computing or big data problems the dimension gets much bigger, so naive multiplication is still too slow, even given the extra processing power we gain by switching to supercomputers.

Given that the procedure described above is a direct implementation of the definition of the matrix product, we might expect not to be able to do very much better than an order of $n^3$ operations for matrix multiplication. It is thus perhaps surprising at first that there do exist algorithms that improve on the exponent. In this series of posts we look at some basic computational complexity theory, and build towards discussing the Strassen multiplication algorithm, which requires an order of $n^{\log_2{7}} \approx n^{2.807}$ operations. This seemingly small decrease in the exponent is already enough to reduce the number of seconds taken to compute the product in the example above by an order of magnitude, cutting the theoretical time required to around three hours.

## Reducing the number of multiplications

To gain some intuition for how this is achieved, consider the following toy example. One learns the traditional method of multiplying two complex numbers in high school — if $x = a + bi$ and $y = c + di$, then their product has real part $ac - bd$ and imaginary part $ad + bc$, so that one needs four real multiplications, two for each of the real and imaginary parts.

Gauss observed that the number of multiplications required can be reduced to three. First compute the three real products

\begin{aligned} k_1 &= c (a + b) \\ k_2 &= b (c + d) \\ k_3 &= a (d - c), \end{aligned}

then the real and imaginary parts of the product are $k_1 - k_2$ and $k_1 + k_3$ respectively.

Observe that there is a cost to doing this: we saved one real multiplication by increasing the number of additions performed from two to five, one for each computation of $k_i$ in addition to the actual calculation of the real and imaginary parts. Since for humans and most modern computers the effort required to multiply two real numbers is roughly the same as that required to add them, this method arguably does not save us a great deal compared to the standard complex multiplication algorithm. But it illustrates a useful idea, namely that in situations where the cost of one particular operation is greater than another, there may exist algorithms that trade the more expensive operation for (a few instances of) the cheaper one, improving the cost of the overall computation.

In the case of square matrices it indeed turns out that multiplying is more expensive than adding, since to add matrices $A$ and $B$ we only need to perform addition on each of their $n^2$ entries, while as we have seen taking their product requires $n^3$ real multiplications. This makes the problem of multiplying two matrices amenable to a similar Gauss-like trick as for complex numbers, which is the basis of Strassen’s algorithm.

## Computational time complexity

To rigorously analyze algorithms we represent them using pseudocode in a style reminiscent of Python (so that we use colons and indentation in place of braces or $\texttt{begin}$ and $\texttt{end}$ keywords to delimit block control structures). In our analysis we consider an elementary operation to be any addition, multiplication or assignment to a variable.[1] We formalize the cost of an algorithm in terms of its time complexity: the number of elementary operations required for the algorithm to terminate, returning a correct answer. This is given as a function $T(n)$ of the size of the input $n$, which in the case of matrix multiplication we take to be the dimension of the matrices. So for example, the algorithm for naïve matrix multiplication is as given below, with the number of operations required to execute one iteration of each line given on the right.

$\begin{array}{|c|l|l} \texttt{1} & \texttt{ for i = 1...n: } & \text{ \small 1 operation} \\ \texttt{2} & \texttt{ \ \ for j = 1...n: } & \text{ \small 1 operation} \\ \texttt{3} & \texttt{ \ \ \ \ C[i,j] = 0 } & \text{ \small 1 operation} \\ \texttt{4} & \texttt{ \ \ \ \ for k = 1...n: } & \text{ \small 1 operation} \\ \texttt{5} & \texttt{ \ \ \ \ \ \ C[i,j] = C[i,j] + A[i,k] * B[k,j] } & \text{ \small 3 operations} \end{array}$

The time complexity of the algorithm is then calculated as follows. Beginning with the innermost loop on $k$ (lines 4–5) we see that each iteration takes four operations, one for the assignment to $k$ and three to update the $i,j$-th entry of the product $C$. This is iterated $n$ times, giving a total of $4n$ operations to execute the subloop indexed by $k$. The next outermost loop on $j$ (lines 2–5) requires $2 + 4n$ operations per iteration — one each for lines 2 and 3, and $4n$ for the $k$-subloop which is run for each value of $j$. This is iterated $n$ times, and so requires $n(2 + 4n)$ operations. Finally the outermost loop indexed by $i$ requires $1 + n(2 + 4n)$ ops per iteration, and is also iterated $n$ times. Thus the time complexity of naive matrix multiplication is $T(n) = n (1 + n (2 + 4n)) = n + 2n^2 + 4n^3$.

## Asymptotic upper bounds and Big-O

One might think that an issue arises due to the fact that our definitions of time complexity and “elementary operations” are purely abstract and not tied to any specific machine architecture. This is a fair concern since in reality when dealing with concrete implementations, an assignment to a variable may take some constant time $c_0$, while addition and multiplication of two real numbers may take some other constant times $c_1$ and $c_2$ respectively, and the time complexity should really be the amount of time taken for the algorithm to complete. If we then go through the same analysis as in the previous section, we get that $T(n) = n(K_1 + n(K_2 + K_3n)) = K_1 n + K_2 n^2 + K_3 n^3$ for some real constants $K_1, K_2, K_3$ which are sums of the times $c_i$ of each elementary operation. But while the constants in the expression for $T(n)$ change, its form does not — it is still a cubic with positive coefficients, and hence its behavior as the size of the input grows is qualitatively the same regardless of the values of $K_i$. This is good since when comparing the performance of two algorithms we are usually interested in how much faster one is than the other as the size of the input becomes large, i.e. we want to compare the asymptotic behavior of the complexities $T(n)$ as $n \rightarrow \infty$. Since changing the constants in $T(n)$ does not change its asymptotic behavior, it is fine to count time in terms of number of operations as we have, and carry out our analysis as before.

However it is not very helpful to carry around the clutter of arbitrary constants, given that they do not impact the asymptotic behavior as $n$ tends to infinity. We thus make the following definition.

Definition (Order of a function). Let $g \colon \mathbb{N} \rightarrow \mathbb{R}$. A function $f \colon \mathbb{N} \rightarrow \mathbb{R}$ is said to be of order $g$, or in $O(g)$ if there exists $n_0 \in \mathbb{N}$ and a real constant $K \geq 0$ such that for all $n > n_0$, $f(n) \leq Kg(n)$.

Intuitively this says that some non-negative multiple of $g$ eventually bounds $f$ from above. We call $g$ an asymptotic upper bound for $f$.

Using this language we can easily formalize and prove our intuition that naïve matrix multiplication is cubic:

Proposition. Naïve matrix multiplication has time complexity $T(n) \in O(n^3)$, and this bound is sharp. (That is, there is no constant $a < 3$ such that $T(n) \in O(n^a)$.)

Proof. We have seen that $T(n) = K_1n + K_2n^2 + K_3n^3$ for some positive $K_1, K_2, K_3$. Let $K = K_1 + K_2 + K_3$, then for all $n \in \mathbb{N},\ T(n) \leq K_1n^3 + K_2n^3 + K_3n^3 = Kn^3$, so $T(n)$ is of order $n^3$.

Sharpness of the bound is clear for $a \leq 0$. Suppose then that $0 < a < 3$. Then $3 - a > 0$, so that for any $K \geq 0$ there exists $n_0 \in \mathbb{N}$ such that $n^{3 - a} > K$ (equivalently, $n^3 > Kn^a$) for all $n \geq n_0$. Thus no multiple of $n^a$ eventually bounds $T(n) \geq n^3$ from above. $\square$

Asymptotic analysis is used all throughout complexity theory to quantify the performance of various algorithms, and in the next post in this series we will use this together with recursion relations to analyze the complexity of divide-and-conquer algorithms, of which Strassen’s algorithm is an example.

[1][Back to text]: There is some art involved in determining the time complexity of an algorithm, not least in terms of deciding what operations to count as elementary — to decide the size of the “smallest unit of work”, as it were. As a rule of thumb it is standard to take the previously-listed operations as well as boolean comparisons (i.e. tests for (in)equality, $\texttt{if}$ statements, etc.) which do not arise in the algorithms discussed in this post. A complication arises due to the fact that not all operations take the same time to complete. For instance, a variable assignment that prepends to the beginning of a long linked list is virtually instantaneous compared to one that appends to the end. In cases like this we may either choose to break the costlier operation down into multiple elementary operations, or to consider the entire operation itself to be elementary, and take the cost of faster operations to be negligible. The idea is to focus on the operations that are “important”, but deciding which these are seems to requires some combination of experience and common sense.