Colin's Notebook

Matrix Multiply Performance in Rust

I was investigating some performance issues with my OpenGL rendering pipeline and I noticed something strange. I would expect my application’s bottle neck to be the OpenGL driver itself, but it wasn’t.

The perf report shed some light on what was happening.

31.30%  snowmew-cube    0x0000000000499676
13.30%  snowmew-cube  snowmew-cube    vector::Array$Vec4::i::hcaf0628a1ba57867EHaL::v0.1
 4.92%  snowmew-cube  snowmew-cube    vector::Vector::anon::expr_fn::aK
 4.49%  snowmew-cube  snowmew-cube    matrix::Matrix::anon::expr_fn::aU
 4.40%  snowmew-cube  snowmew-cube    matrix::Array$Mat4::i::he1f1d1851d9b988yDaW::v0.1
 4.34%  snowmew-cube  snowmew-cube    cast::transmute::h1ffcc99bbb2bd736aN::v0.1
 4.01%  snowmew-cube  snowmew-cube    vector::Array$Vec4::as_slice::hf9de2b21ec3f764EHaM::v0.1
 3.45%  snowmew-cube  snowmew-cube    f32::Mul$f32::mul::hee1c5634fa6695bpFak::v0.1
 2.35%  snowmew-cube  snowmew-cube    clone::Clone$f32::clone::h90c1bbdb3b318f58v1au::v0.1
 2.19%  snowmew-cube  snowmew-cube    f32::Add$f32::add::h8e8e5f294b1db5c4GYab::v0.1

So the driver is only taking 31% of the total time. I expect this to be closer to 90%; something is clearly wrong. The application is spending most of it’s time in the standard library. This turned out to be a silly problem, I had walked unexpectedly into a run into a bug in rustpkg.

Moving to a newer version of rust and recompiling I see the results are closer to what I expect.

83.64%  snowmew-cube                        [.] 0x0000000001ba12ca
 4.50%  snowmew-cube                        [.] __memmove_ssse3_back
 4.09%  snowmew-cube  [kernel.kallsyms]                   [k] 0xffffffff81044ffa
 1.71%  snowmew-cube  snowmew-cube                        [.] main::anon::anon::expr_fn::aW
 1.27%  snowmew-cube                  [.] 0x0000000000073b75
 0.50%  snowmew-cube                        [.] __memset_sse2

As you can see, rust performs terribly without optimizations enabled. So I’m not surprised by my initial findings. My next question was how well matrix multiplication actually compares to the equivalent in C. I was using cgmath-rs for most of my testing, but I decided to write my own to test how well rust optimizes it.

struct mat4 {
    float dat[4][4];

struct mat4
naive_mult_m(const struct mat4 a, const struct mat4 b)
    int i, j, k;
    struct mat4 out = {
        dat: { {0., 0., 0., 0.},
               {0., 0., 0., 0.},
               {0., 0., 0., 0.},
               {0., 0., 0., 0.}}

    for (i=0; i<4; i++) {
        for (j=0; j<4; j++) {
            for (k=0; k<4; k++) {
                out.dat[i][j] += a.dat[i][k] * b.dat[k][j];

    return out;

The rust version is nearly identical:

struct Mat4 {
    dat: [[f32, ..4], ..4]

pub fn mult_m(a: Mat4, b: Mat4) -> Mat4
    let mut out = Mat4 {
        dat: [[0., 0., 0., 0.],
              [0., 0., 0., 0.],
              [0., 0., 0., 0.],
              [0., 0., 0., 0.]]

    for i in range(0, 4) {
        for j in range(0, 4) {
            for k in range(0, 4) {
                out.dat[i][j] += a.dat[i][k] * b.dat[k][j];


In both of the naive examples. We are passing by value rather then by reference. Since this function will ultimately be inlined for this benchmark it should not matter.

Running the benchmarks returns the following results.

| Implementation | Time (seconds) | Branches per multiply |
|       Rust -O0 |          185.5 |                  3022 |
|          C -O0 |          19.37 |                   231 |
|       Rust -O3 |           6.86 |                   153 |
|          C -O3 |           1.05 |                     5 |

As you can see, the naive implementation in rust is quite slow compared to the C one. One of the reasons the rust implementation is slow without any optimization is because by lack of inlined code. You can see the results by the perf report, or just by looking at the insane number of branches that are needed to do a single multiply. The fact that 30% of the time is spent in Range::next() is crazy.

33.57%  mat4  mat4    Mat4::mult_m::hb15ee85fd5e8382fvgap::v0.0
29.39%  mat4  mat4    iter::Iterator$Range::next::h9d319f95b1298dbboCax::v0.0
15.28%  mat4  mat4    int::generated::Ord$int::lt::h44fcab3da538dd3cdyaA::v0.0
 7.09%  mat4  mat4    int::generated::Add$int::add::h146b13e2df0ee9e7qaE::v0.0
 5.49%  mat4  mat4    option::Option::Some::hee10bda399bd6baF::v0.0
 4.39%  mat4  mat4    clone::Clone$int::clone::h831ca4a91d9d7a2UkaC::v0.0
 2.87%  mat4  mat4    iter::range::hc3c9f0403485c3b7as::v0.0
 1.07%  mat4  mat4    int::generated::One$int::one::h19f69f3e36a23f4eoav::v0.0

So why is rust seven times slower then C still? The number of branches gives a pretty good hint, a 4x4 matrix multiply is made up of 64 multiply accumulates, with a 153 branches we seeing more then 2.4 branches per multiply accumulate. The C implementation on the other hand has only 5 branches, and that includes the for loop that I am using to profile this.

Lets look at the naive implementation’s inner loop.

cmp    $0x3,%r13
ja     401048 <_ZN4main19h8bdb7b192642b8caaG4v0.0E+0x3a8>
lea    0x1(%r13),%r15
xor    %edx,%edx
mov    %rdx,%rax
mov    %r13,%rcx
shl    $0x4,%rcx
lea    -0x70(%rbp,%rcx,1),%rcx
lea    (%rcx,%rax,4),%rbx
lea    0x1(%rax),%rdx
mov    %rdi,%rsi
mov    %r8,%r12
xor    %ecx,%ecx
cmp    $0x4,%rcx
jae    401031 <_ZN4main19h8bdb7b192642b8caaG4v0.0E+0x391>
inc    %rcx
movss  (%rsi),%xmm1
mulss  (%r12,%rax,4),%xmm1
addss  (%rbx),%xmm1
movss  %xmm1,(%rbx)
add    $0x4,%rsi
add    $0x10,%r12
cmp    $0x4,%rcx
jl     400de0 <_ZN4main19h8bdb7b192642b8caaG4v0.0E+0x140>
cmp    $0x3,%rdx
jle    400dc0 <_ZN4main19h8bdb7b192642b8caaG4v0.0E+0x120>
add    $0x10,%rdi
cmp    $0x3,%r15
mov    %r15,%r13
jle    400db0 <_ZN4main19h8bdb7b192642b8caaG4v0.0E+0x110>

You can see it is doing three branches at the bottom, one for each of the loops. So, there will be a bare minimum of 63 branches from just the range() operators. There are two other branches in this logic near the top, and middle. They lead out of the loop and call to task_fail. The extra logic is for bounds checking, and it is needed in the inner most loop. Bounds checking adds an extra 68 branches bringing out total up to 131 branches. Considering these bounds checks cannot fail they are not helpful in this case.

The bigger problem is much more obvious when you compare this with the C output.

nopl   0x0(%rax)
movups 0xc0(%rsp,%rcx,1),%xmm0
movss  0x80(%rsp,%rcx,1),%xmm5
movss  0x84(%rsp,%rcx,1),%xmm6
pshufd $0x0,%xmm5,%xmm5
mulps  %xmm1,%xmm5
addps  %xmm0,%xmm5
pshufd $0x0,%xmm6,%xmm0
mulps  %xmm2,%xmm0
addps  %xmm5,%xmm0
movss  0x88(%rsp,%rcx,1),%xmm5
pshufd $0x0,%xmm5,%xmm5
mulps  %xmm3,%xmm5
addps  %xmm0,%xmm5
movss  0x8c(%rsp,%rcx,1),%xmm0
pshufd $0x0,%xmm0,%xmm0
mulps  %xmm4,%xmm0
addps  %xmm5,%xmm0
movups %xmm0,0xc0(%rsp,%rcx,1)
add    $0x10,%rcx
cmp    $0x40,%rcx
jne    4008a0 <main+0xc0>

This loops four times to do the matrix multiply. In each loop iteration an entire row of the matrix is calculated, vs a single element in the rust implementation. This is what I was hoping to see from the rust compiler.

I’m pretty sure that someday rust will be able to do this automatically, but I want the performance now not later. My next thought was to remove the loops all together. This is pretty common in C, and can also be easily done in rust.

First the C version

#define DOT(OUT, A, B, i, j) \
    OUT[i][j] = A[i][0] * B[0][j] + \
                A[i][1] * B[1][j] + \
                A[i][2] * B[2][j] + \
                A[i][3] * B[3][j]
struct mat4
unroll_mult_m(const struct mat4 a, const struct mat4 b)
    struct mat4 out;

    DOT(out.dat, a.dat, b.dat, 0, 0);
    DOT(out.dat, a.dat, b.dat, 0, 1);
    DOT(out.dat, a.dat, b.dat, 0, 2);
    DOT(out.dat, a.dat, b.dat, 0, 3);    

    DOT(out.dat, a.dat, b.dat, 1, 0);
    DOT(out.dat, a.dat, b.dat, 1, 1);
    DOT(out.dat, a.dat, b.dat, 1, 2);
    DOT(out.dat, a.dat, b.dat, 1, 3);    

    DOT(out.dat, a.dat, b.dat, 2, 0);
    DOT(out.dat, a.dat, b.dat, 2, 1);
    DOT(out.dat, a.dat, b.dat, 2, 2);
    DOT(out.dat, a.dat, b.dat, 2, 3);    

    DOT(out.dat, a.dat, b.dat, 3, 0);
    DOT(out.dat, a.dat, b.dat, 3, 1);
    DOT(out.dat, a.dat, b.dat, 3, 2);
    DOT(out.dat, a.dat, b.dat, 3, 3);    

    return out;

Now for the rust version:

macro_rules! dot(
    ($A:expr, $B:expr, $I:expr, $J:expr) => (
        $A.dat[0][$I] * $B.dat[$J][0] +
        $A.dat[1][$I] * $B.dat[$J][1] +
        $A.dat[2][$I] * $B.dat[$J][2] +
        $A.dat[3][$I] * $B.dat[$J][3]

pub fn mult_m(a: Mat4, b: &Mat4) -> Mat4
    Mat4 {
        dat: [[dot!(a, b, 0, 0), dot!(a, b, 1, 0), dot!(a, b, 2, 0), dot!(a, b, 3, 0)],
              [dot!(a, b, 0, 1), dot!(a, b, 1, 1), dot!(a, b, 2, 1), dot!(a, b, 3, 1)],
              [dot!(a, b, 0, 2), dot!(a, b, 1, 2), dot!(a, b, 2, 2), dot!(a, b, 3, 2)],
              [dot!(a, b, 0, 3), dot!(a, b, 1, 3), dot!(a, b, 2, 3), dot!(a, b, 3, 3)]]

Now for the numbers

|    Implementation | Time (seconds) | Branches per multiply |
|          Rust -O0 |          185.5 |                  3022 |
|             C -O0 |          19.37 |                   231 |
|          Rust -O3 |           6.86 |                   153 |
|             C -O3 |           1.05 |                     5 |
| unrolled Rust -O0 |          22.84 |                   548 |
|    unrolled C -O0 |           2.66 |                    16 |
| unrolled Rust -O3 |           0.86 |                     1 |
|    unrolled C -O3 |           0.74 |                     1 |

Holy! Rust is faster then the naive C implementation, and within 15% of the performance of the unrolled C implementation. Neither the C nor the Rust implementation has any branches to do the matrix multiply now. Lets look at the assembly of the rust version to get an idea of what happened.

unpcklps %xmm14,%xmm4
unpcklps %xmm0,%xmm8
unpcklps %xmm4,%xmm8
unpcklps %xmm9,%xmm3
unpcklps %xmm11,%xmm5
unpcklps %xmm3,%xmm5
movaps %xmm8,%xmm0
xorps  %xmm9,%xmm9
mulps  %xmm9,%xmm0
unpcklps -0xe0(%rbp),%xmm13
unpcklps %xmm15,%xmm1
unpcklps %xmm13,%xmm1
movaps %xmm1,%xmm2
mulps  %xmm9,%xmm2
movaps %xmm2,%xmm3
addps  %xmm0,%xmm3
movaps %xmm5,%xmm4
mulps  %xmm9,%xmm4
addps  %xmm8,%xmm2
addps  %xmm0,%xmm1
unpcklps -0xd0(%rbp),%xmm10
movaps %xmm4,%xmm7
addps  %xmm3,%xmm7
unpcklps %xmm12,%xmm6
unpcklps %xmm10,%xmm6
addps  %xmm6,%xmm7
pshufd $0x1,%xmm7,%xmm10
movaps %xmm7,%xmm12
movhlps %xmm12,%xmm12
addps  %xmm5,%xmm3
addps  %xmm4,%xmm1
pshufd $0x3,%xmm7,%xmm0
movdqa %xmm0,-0xd0(%rbp)
addps  %xmm4,%xmm2
mulps  %xmm9,%xmm6
addps  %xmm6,%xmm2
dec    %rax
addps  %xmm6,%xmm1
addps  %xmm3,%xmm6
pshufd $0x1,%xmm2,%xmm4
pshufd $0x3,%xmm2,%xmm14
movaps %xmm2,%xmm0
movhlps %xmm0,%xmm0
movaps %xmm6,%xmm5
pshufd $0x1,%xmm1,%xmm13
pshufd $0x3,%xmm1,%xmm3
movdqa %xmm3,-0xe0(%rbp)
movaps %xmm1,%xmm15
movhlps %xmm15,%xmm15
pshufd $0x1,%xmm6,%xmm3
pshufd $0x3,%xmm6,%xmm9
movhlps %xmm6,%xmm6
movaps %xmm6,%xmm11
movaps %xmm2,%xmm8
movaps %xmm7,%xmm6
jne    400d30 <_ZN4main19h14be583994e2f6f5aN4v0.0E+0x90>

This assembly could have come from c/c++ and I would not have questioned it. As far as I can tell there is no extra overhead in this loop that has been caused by using rust. The 15% performance difference could be from different version of llvm. So I’m pretty happy the results.

For sake of thoroughness I thought I would share the comparison with cgmath’s matrix multiply. Since it caused me to kicking off this entire investigation.

|    Implementation | Time (seconds) | Branches per multiply |
|          Rust -O0 |          185.5 |                  3022 |
|          Rust -O3 |           6.86 |                   153 |
| unrolled Rust -O0 |          22.84 |                   548 |
| unrolled Rust -O3 |           0.86 |                     1 |
|   cgmath Rust -O0 |         428.11 |                  5138 |
|   cgmath Rust -O3 |            3.1 |                     1 |

cgmath without optimizations is very slow. With them, it is 3.6 times slower then what the unrolled version. My hunch was lack of auto vectorization, the disassembly suggests otherwise. It’s to long to post here but I’ll link to it.

I’m not sure what is happening with cgmath. The matrix multiply is done nearly identically to what I just showed. The extra layers of indirection may cause llvm to miss optimization opportunities. More investigation is needed.

fn mul_m(&self, other: &Mat4<S>) -> Mat4<S> {
    Mat4::new(self.r(0).dot(other.c(0)), self.r(1).dot(other.c(0)), self.r(2).dot(other.c(0)), self.r(3).dot(other.c(0)),
              self.r(0).dot(other.c(1)), self.r(1).dot(other.c(1)), self.r(2).dot(other.c(1)), self.r(3).dot(other.c(1)),
              self.r(0).dot(other.c(2)), self.r(1).dot(other.c(2)), self.r(2).dot(other.c(2)), self.r(3).dot(other.c(2)),
              self.r(0).dot(other.c(3)), self.r(1).dot(other.c(3)), self.r(2).dot(other.c(3)), self.r(3).dot(other.c(3)))

fn r(&self, r: uint) -> V {
    build(|i| self.i(i).i(r).clone())

fn c<'a>(&'a self, c: uint) -> &'a V { self.i(c) }

fn i<'a>(&'a self, i: uint) -> &'a $T {
    &'a self.as_slice()[i]

fn dot(&self, other: &Self) -> S { self.mul_v(other).comp_add() }

fn mul_v(&self, other: &Self) -> Self { build(|i| self.i(i).mul(other.i(i))) }


Raw data can be found here

Gist of all source

Software versions used:

rustc 0.9-pre (b60b411 2013-11-16 20:26:01 -0500)
Ubuntu clang version 3.4-1~exp1 (trunk) (based on LLVM 3.4)
icc (ICC) 14.0.1 20131008
gcc-4.8 (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1