Achievement unlocked: Beat the staff
The CS 61C Performance Contest is a project where students will optimize given C code, this time a convolutional neural network, as fast as possible. I did a solo and finally achieved 29.968x speed up and got the extra beat-the-staff credits. It was a wonderful journey and I would like to share about some techinques I used.

Gradescope result
Disclaimer: I am not a UC Berkeley student and I don’t have access to the hive machines. Although the project states that the code need to be run on a hive machine but there is no major obstacles debugging and running on a unix-like OS, i.e. macOS, as long as your CPU supports Haswell assembly instructions. The dataset on the hive machine is also public available: I download the CIFAR dataset and run the code on a Macbook Pro 2016.

Profiling

The project introduces some useful gprof commands for profiling. Unforturnately gprof is not supported on macOS, so I uses pprof instead, which can generate similar gprof-like flat profile.

1
2
3
$ cc -lm -lprofiler -Wall -Wpedantic -Wextra -march=haswell -std=c99 -fopenmp -O3 -o benchmark benchmark.o network.o layers.o volume.o
$ ./benchmark benchmark 1200 CPUPROFILE=/tmp/prof.out
$ pprof --top ./benchmark /tmp/prof.out

Here is an (incompleted) output example

1
2
3
 flat  flat%   sum%        cum   cum%
0.03s 0.51% 99.32% 0.03s 0.51% _posix_madvise
0.01s 0.17% 99.49% 0.20s 3.42% _load_batch

You should interpret the profile to find out which routine occupies most percents of running time.

A beard well lathered is half shaved.

The profiling command is used so often that I would recommend to write your own Makefile rule, i.e. make benchmark_profile from the very begining so that you can profile your program with minimum cognitive loads.

Common Techniques

Hoist memory operations

Hoise memory operations in nested loops if you know they are not changed inside the loop. For example,

1
2
3
4
for (int i = 0; i < 1 << 16; i++) {
baz();
foo->bar->quz[i] = i;
}

The compiler would not know whether foo->bar->quz points to the same memory location in the loops. Therefore it would generate three mov (lw in RISC-V) inside the loop. If you, Homo sapiens, know in advance that foo->bar->quz is a loop invariant, please explicitly convey the fact to the compiler by hoisting the memory operation.

1
2
3
4
5
int* quz = foo->bar->quz;
for (int i = 0; i < 1 << 16; i++) {
baz();
quz[i] = i;
}

There is a common misunderstanding that you should hoist everything you can. No! Any modern optimizer will hoist simple math operations for you, (but not memory operation since they are not semantically equivalent). For example you should not worry about whether you need to hoist j * j, the optimizer will do this automatically. Hoisting every simple math operation simply wastes your precious time.

1
2
3
4
5
6
int* quz = foo->bar->quz;
int j = 42;
for (int i = 0; i < 1 << 16; i++) {
baz();
quz[i] = i + j * j;
}
Eliminate branches

Scrutinize the branch operations in deeply nested loops, especially those conditions related to the loop counter only. You can replace branch by carefully chosen counter ranges.

1
2
3
4
5
for (int i = 0; i < 1 << 16; i++) {
if (i >= m && i < n) {
/* do something on i */
}
}

In the example above, we can get rid of the if branch by having i loop from min(0, m) to max(1<<16, n).

Minimize number of variables involved in deep loops

Use your Math skills to minimize the variables inside deep loops. For example, let’s say you have a 8-register CPU runing the following code

1
2
3
4
5
6
7
8
9
10
loop1:
for (int i = 0; i < 1 << 16; i++) {
/* do something with j, k, l, m, p */
q = m * k;
loop2:
for (int n = 0; n < 1 << 16; n++) {
quz = (n + m) * k
}
/* do something with j, k, l, m, p, q */
}

Before jumping to loop2 with four variables involved, the compiler have to put at lease one register used in the loop1 onto the stack, and restore after loop2 is finished. This process is called register allocation.

Apparently quz = (n + m) * k can be simplified to quz += k if one initializes quz = q.

1
2
3
4
5
6
7
for (int i = 0; i < 1 << 16; i++) {
/* do something with j, k, l, m, p */
q = m * k;
loop2:
for (int n = 0, quz = q; n < 1 << 16; n++, quz += k) { }
/* do something with j, k, l, m, p, q */
}

In the latter cases loop2 still has 4 variables (n, quz, k, q), but since q is reused, we don’t have to save anything on the stack,which means, let’s say p is in %rbx you have saved 65536 push %rbx and 65536 pop %rbx instructions.

As a byproduct you have also reduced the computation strength of (n + m) * k to quz + k, but the bottleneck here is mov since the latency difference of addition (0.5) and multiplication (1.75) is smaller than L1 Cache latency (3).

Inline small static functions, copy & paste if compiler doesn’t

Function with many arguments will incur spill and reload register pressure, which means you will have much more mov in the prologue and epilogue of the called function. You may know that in C we could declare that a function is inline so the compiler will try to merge the intructions of this function to wherever it is called. But any modern optimizier will automatically inline small functions even if you don’t specify it is inline.

1
2
3
4
5
6
7
8
9
10
11
#include "helpers.h"

/* will be automatically inline in most cases */
static int foo(int x) { return x + 1; }

int main() {
foo(0);
/* would not be inline if bar is defined in helpers.h */
bar();
return foo(0);
}

However, there are some exception cases where the compiler could not inline the source. For example, if a small function is defined in other headers, the compiler would not inline this function since it is in another object file. But we know it is small and static, so we can simply copy paste these functions so that our critical parts will execute an inline version of these small helpers. By doing this we can get rid of the function overhead: extra spills and reloading stack memory access.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include "helpers.h"

/* copy pasted from helpers.h */
static int bar(int x) { return x + 2; }

/* will be automatically inline in most cases */
static int foo(int x) { return x + 1; }

int main() {
foo(0);
/* would be inline since bar is defined in same object file */
bar();
return foo(0);
}

You may have a sense now our performance optimization actually focus on the number of memory operations. And yes visiting memory/cache is way slower than computation. The more mov you have reduced, the more performance gain you get.

OpenMP

Although the OpenMP section is in the last. Personally I recommend to apply OpenMP in the earlier stage as long as you figure out where to add OpenMP pragmas. It should be added to the biggest parallelable computation units with minimum input variables. Once you have correctly configured OpenMP, you will have an independently 5-6x boost on a 4-core 8-threads machines. Combining OpenMP with the common techniques will give you 12x boost.

If you are running project on macOS like me, you may be teriffied by the fact that every partest fails on macOS. Note that partest simply tests your program against a rand() indexed of datasets (code), and macOS has different rand() implementation to Linux. So you have to generate the partest output from reference implementation yourself.

I have tried different OpenMP schedulers but they don’t differ a lot. I think we’d better keep it simple.

Unrolling

A modern optimizer will do loop unrolling for you so you don’t have to unroll if you don’t have a good reason, that is, you know nothing more than the compiler. I will mentioned unrolling later.

Specialization

Before talking about SIMD instructions, I would like to shred on specialization. A way that you convery extra information to the compiler. Let’s say if the critical part is vector operator: dot product, the compiler would not know the length of the vector, but you can always print at runtime to see if it reveals any pattern. After I reach 12x, I have been stuck for a while. It is a break through when I notice that the length of vector is actually discrete. I added a switch case to delegate dotProduct to different versions: 3x3, 16x16 and 20x20. The specializaed dotProducts get rid of the deepest for loop and we don’t need allocate register to the loop counter now.

SIMD instructions or Assembly Programmer

Hotpath

Both SIMD version of 16x16 and 20x20 dot product are straighforward to implement, for 3x3 dot product I leave is as-is because I don’t find any significant speed improvement. The SIMD version of 16x16 and 20x20 will boost performance upto 16x. You may relax since then, actually some students stop here according to the search result of github.

But if you want to beat the staff like what I did, it is only the begining. It is extremely important to profile how often or the percentage of time spend on dotProduct3_3, dotProduct16_16 and dotProduct20_20. You may find that the helper function dotProduct16_16 is inlined in -O3 but we can use directive to tell the compiler do not inline these functions.

1
2
__attribute__((noinline))
static void dotProduct3_3(double * dst, const double * dp_x, const double * dp_y)

Now you can see the benchmark

1
2
3
0.35s  8.86% 64.05%      0.35s  8.86%  _dotProduct16_16
0.25s 6.33% 70.38% 0.25s 6.33% _dotProduct3_3
0.08s 2.03% 93.42% 0.08s 2.03% _dotProduct20_20

Apparently, dotProduct16_16 and dotProduct3_3 is the hot path.

Devtools: Make a profiler switch

Note that you would soon find it cumbersome to switch __attribute__((noinline)) back and forth, thanks to C preprocessor we can define a __MODIFIER__ macro

1
2
3
4
5
6
7
8
#ifdef _PROFILE_
#define __MODIFIER__ __attribute__((noinline))
#else
#define __MODIFIER__
#endif

__MODIFIER__
static void dotProduct3_3(double * dst, const double * dp_x, const double * dp_y)

and create a profile-switch.h which simply define the _PROFILE_ macro

1
#define _PROFILE_

Then you can include profile-switch.h only when it is profiled.

1
2
3
4
5
6
7
8
9
10
11
CC=/usr/local/opt/llvm/bin/clang
CFLAGS?=-Wall -Wpedantic -Wextra -march=haswell -std=c99 -fopenmp -O3
PROFILE_CFLAGS=-lprofiler -Wall -Wpedantic -Wextra -march=haswell -std=c99 -fopenmp -O3

benchmark_profile : benchmark.o network.o layers.o_profile volume.o
$(CC) $(PROFILE_CFLAGS) -o benchmark benchmark.o network.o layers.o volume.o -lm
CPUPROFILE=/tmp/prof.out ./benchmark benchmark
pprof --top ./benchmark /tmp/prof.out

layers.o_profile : layers.c layers.h volume.h
$(CC) $(CFLAGS) -include profile-switch.h -c layers.c -o layers_profile.o
Destroy small loops

I have mentioned that a SIMD version of dotProduct3_3 does not significantly improve performance. Let’s analyse it a bit. In dotProduct3_3 we will do two vmovupd (_mm256_loadu_pd) to load 4 64bit-memory cells, interpreted as double, into two YMM registers. But we are only computing a 3x3 dot product: we wasted 1/4 of the memory bandwidth!

If you take a closer look to the dotProduct3_3 loop, you will find the memory access location is continuous, i.e.

1
2
3
dotProduct3_3(dst, a[0], b[0]);
dotProduct3_3(dst, a[3], b[3]);
dotProduct3_3(dst, a[6], b[6]);

which means a loop of five dotProduct3_3 operation is equivalent to a dotProduct15_15 operation.

By checking the runtime states we can find that the loop bound is discrete, too. Profiling reveals that dotProduct15_15 is the new hot path. We can estimate that a dotProduct15_15 is 25% faster than 5 consequent dotProduct3_3. Why? Because we just need 8 vmovupd for dotProduct15_15, which is bounded by dotProduct16_16.

You may expand the details if you are interested at the tail-case processing
1
2
3
4
5
6
__m256d x_tail = _mm256_loadu_pd(x[11]); // x[14], x[13], x[12], x[11]

x_tail = _mm256_blend_pd(x_tail, _mm256_zero_pd(), 0x1); // x[14], x[13], x[12], 0

_mm256_mul_pd(x_tail, _mm256_loadu_pd(y[11]);
// returns x[14] * y[14], x[13] * y[13], x[12] * y[12], 0

In a way it is also a kind of unrolling with one loop only, but we have done more than that and reduces the number of vmovupd.

Some extra questions to think about: Do we need rewrite three dotProduct3_3 to dotProduct9_9? Do we need rewrite three dotProduct16_16? How about dotProduct20_20?

This change brings me 22x boost on Aug. 9.

Unrolling or abusing the precompiler?

The next thing I tried is unrolling the dotProduct16_16. Note that rewriting dotProduct16_16 to larger dotProduct does not benefit a lot since we can not decrease the number of vmovupd, it is optimal. But we can still get rid of loop by unrolling. The dotProduct16_16 is inside of number 3/4/5 loops. I ended up writing a macro which simply expands to a for loop with fixed bound. It is like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#define dotProductRoutine16_16(Round) {\
for(int i = 0; i < Round; i++) {\
dotProduct16_16(dotProductSum, a[/* index from i */], b[/* index from i */]);\
}\
}

switch(round) {
case 5:
dotProductRoutine16_16(5);
break;
case 4:
dotProductRoutine16_16(4);
break;
default:
dotProductRoutine16_16(3);
break;
}

The compiler does a good job optimizing those fixed loops and I have got 25x. But I knew from the comments that our staff has come up with a 27x in hand. So I should definitely do more to get extra credits.

Memory access

For the convenience of writing organization, I have placed memory access in an indepedent section. Actually I keeps thinking if there is anything I can improve on general memory access pattern other than the convolution step. It turns out there are several aspects you can look into:

  • Try to print out the assignment values after a big malloc, if it is always zero, consider switch to hardware-friendly calloc. In macOS the compiler generates a special __platform_bzero$VARIANT$Haswell call which is way faster than accessing the memory and write only zero back. Yes we are still decreasing the number of mov.
  • If sometimes the assignment is zero and sometimes not, consider switch to a memset and write to the memory only when it is not zero.
  • Are there any loop pattern results to a high cache miss rate? For example are there any write steps where a whole cache line is loaded but only written one value? Consider to modify loop levels to leverage CPU caches.

All these efforts pay off: It brings me to 29.968x in Aug. 12

Performance Reports on my local machines

Flat profile of classifying 1200 images

1
2
3
4
5
6
7
8
9
10
11
12
13
 flat  flat%   sum%        cum   cum%
1.15s 64.97% 64.97% 1.16s 65.54% _conv_forward
0.22s 12.43% 77.40% 0.22s 12.43% __platform_bzero$VARIANT$Haswell
0.19s 10.73% 88.14% 0.19s 10.73% _swtch_pri
0.07s 3.95% 92.09% 1.43s 80.79% [libomp.dylib]
0.04s 2.26% 94.35% 0.04s 2.26% _posix_madvise
0.03s 1.69% 96.05% 0.03s 1.69% _read$NOCANCEL
0.02s 1.13% 97.18% 0.02s 1.13% _pool_forward
0.01s 0.56% 97.74% 1s 56.50% _.omp_outlined.
0.01s 0.56% 98.31% 0.01s 0.56% __kernelrpc_mach_vm_deallocate_trap
0.01s 0.56% 98.87% 0.04s 2.26% _free_small
0.01s 0.56% 99.44% 0.01s 0.56% _lseek
0.01s 0.56% 100% 0.01s 0.56% _relu_forward

The other forward operation has been optimized for memory access. _conv_forward, __platform_bzero$VARIANT$Haswell(zero-filling the memory), _swtch_pri(macOS libsystem_kernel calls) are the top-3 computing intensive calls.

Comparison to baseline: classifying 2400 images

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
./benchmark benchmark 2400
RUNNING BENCHMARK ON 2400 PICTURES...
Making network...
Loading batches...
Loading input batch 0...
Running classification...
79.083333% accuracy
820224 microseconds
./benchmark_baseline benchmark 2400
RUNNING BENCHMARK ON 2400 PICTURES...
Making network...
Loading batches...
Loading input batch 0...
Running classification...
79.083333% accuracy
36103328 microseconds

When comparing more than 1200 images, the thread overhead is ammortized and we could get a better speed up to 40x.

Summary

Performance programming is like collaborating with the compiler: you want to feed more facts into compiler so that it can help you get rid of superfluous memory operations. I learned C ten years ago and only in this project, and the other projects in CS-61C that I feel connections with this so called stone-age language. Its great flexibility and straight-forward memory layout enable various way to utilize the hidden power inside our hardware.

A beard well lathered is half shaved. Always spend some time on your development tools: scaffolds, mnemonics, command + R… Switching context not only gives your brain a break but also keep the fire burning so that you will not be blamed by yourself: I ended up doing no commits in a whole day!

CS 61C is the first course that I have spent so much efforts after I was graduated. I can’t believe I could finish all the projects and homeworks without access to TA and hive machines. It has inspired me a lot and proved to myself my learning abilities. Thank you CS 61C staffs for offering me such a great course and intriguing projects. Thank you my friend Wèi Cōngruì for dicussions on memory layout issues. Thank you University of Waterloo for offering a working environment to a new immigrant like me, for free.