GCC Optimization Options
06/09/2013 § Leave a comment
I’ve been really curious about optimization options with GCC and how different options affected my code so I spent the better part of the weekend working with a simple mandelbrot projection displayer written in C. It projects the classic graphic representation of the set in an X window. Its a great test case as it uses floating point math, calculations with complex numbers, and OpenGL, so it sucks up some pretty heavy duty cpu cycles to draw the projection. The test case is here if you’re interested in benchmarking your own system similarly. Start it from a shell and after you close the display window the execution time will be displayed in seconds.
First, as a starting point I compiled the program without any options; the projection appeared on my desktop in about 15 seconds. Then I added an option I knew would make a difference, the “arch” option. The current form of the flag is “-march=”, this tells the compiler to target the specified cpu architecture. I knew I had a corei5 cpu, but to make sure of this I issued an “lscpu” in a shell and got this:
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
CPU MHz: 800.000
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
Ok, good to know, but not what I was looking for- this is all stuff I know. Fortunately in my bag of tricks I have this:
gcc -c -Q -march=native --help=target | grep march. On my machine the grep returns this:
Thoughtful of gcc to provide exactly what I need to pass to it. So, plugging in my compile command:
gcc -Wall -lGL -lglut -lGLU -march=corei7-avx mand.c -o mand I got the execution time down to 13 seconds. Better, but… can we do better? Absolutely. We haven’t even told the compiler to optimize the code in its own way yet. The “-O” flag is the flag I’m talking about. According to the docs this flag takes a numeric parameter and a few alpha ones. The numeric params tell the compiler what “level” to optimize to, with the levels being 0 through 3, with 3 being the most.
So plugging “-O3” into my command line I see the execution speed up to a little over 9 seconds. Pretty good. But is that the best? I see an alphabetic option to “-O”, “fast”, for fast math. That looks promising. Plugging it in I get an execution time of 10+ seconds. What? Not even helpful. Ok, I see one option left; “-Os”, or “optimize for size.” Ok, let’s check it out. I plug it in and notice a full 1 second performance boost over the command that only employed the arch option. Wow, quite a difference. It seems the less code this particular application uses means less for the compiler to do, and hence less execution time. Makes sense, but I really didn’t think it would make that much a difference. Wrong!
Ok, I have one last collection of options I’d like to try; everything. Or at least everything suggested by a gcc grep I found the other day:
echo "" | gcc -march=native -v -E - 2>&1 | grep cc1. This appears to actually be all the options supported by the compiler on the current platform, I really don’t think its a suggestion of what I should be using as appeared to be worded by the blogger who wrote it. And I’m right, using the string returned by this grep in my compile invocation the program runs as slowly as it did without any optimizations.
I played with many more options and variations on options than I have written about here, including the “mtune” option, which should be the same as whatever you’re passing to “march”, and it made no difference. Another option that didn’t seem to make a difference surprisingly was “-funroll-loops”, surprisingly to me because there are a number of loops in the program, especially in the init phase. Ultimately you need to use timing in your code and common sense in your brain to get the most out of gcc, like anything else.
Yeah, that’s right.
PS: I shaved an entire second off the runtime by simple declaring my constants.