Dive into your assembly code!

When debugging or trying to maximize the performance of a program, it is often useful to look at the assembly code generated by the compiler. There are many tutorials on how to generate assembly code with gcc command, but many times you have to work with a large, existing codebase that is composed of many CMake targets and library. In this article I will compare different methods to read assembly code from C/C++ projects. Typical use cases are:

When you are dealing with functions independent from the rest of the codebase, or with a few files to analize, best universal tool is “Compiler Explorer”.
When you are working with a large project that uses CMake, you can use default targets to see the assembly of cpp files created by CMake. You can modify the flags used to generate that assembly, details in “GCC” and “CMake” sections.
When you have a compiled object file you can use objdump tool to see the assembly code, see “Assembly from object file” section.

Compiler Explorer

Compiler Explorer is a web-based tool that allows you to see assembly code generated by compiler in “real time” (every change to source code recompiles the assembly). It was created by Matt Godbolt and is available at godbolt.org. It supports many compilers and you can use it to compare the assembly code generated by different compilers. Assembly code is cleaned up and colorized, so it is easier to read. You can also see which lines of the assembly code correspond to which lines of the source code.

Compiler Explorer can also be run locally. It is open source and available on GitHub. To install it, you need to have Node.js installed. Clone the repository and run make in the root directory. This will install all the necessary dependencies, build the project, find the compilers installed on your system and start the http server. More detailed instructions can be found here.

But most of the time you will want see the assembly of an existing project, containing multiple files and libraries. Fortunately, Compiler Explorer supports CMake. When running locally, to load a project you have to:

From the top menu `Add -> Tree (IDE Mode).
From the Tree menu Project -> Choose file and select zipped project file.
Select CMake checkbox.
Choose build type for cmake, that would contain debug information (e.g. -DCMAKE_BUILD_TYPE=Debug).
Write name of the target you would like to compile
Select Add new -> Compiler and wait for it to compile.

The C++ project I currently work with has 160 files, so Godbolt had a tough challenge, but the website successfully handled the challenge of compiling my project. It wasn’t the smoothiest experience, but that’s understendable. The site was very slow, and the project took 115 seconds to recompile every change (locally it takes 10 seconds, because I can build it with 12 cpu cores). I had to change the timout value, which was 10 seconds by default (github issue here). To do that you have to create a file from the root directory of the project called etc/config/compiler-explorer.local.properties and add the following line: compileTimeoutMs=100000.

It was clear that the main use for the project is to work on server, not locally. I could not see the output CMake generated until the compilation was finished and the files I uploaded to the website were not the same as the ones I had on my computer.

It’s important to note, that there exists a script from the compiler-explorer project called asm-parser, that will clean the assembly code.

Compiling with GCC

Simple answer: use the -S option, which tells the compiler to stop after the assembly phase. The result will be assembly code with .s extension.

1
g++ -S main.cpp -o main.s

Unless you can read AT&T assembly Syntax you will probably want to use the -masm=intel option to get the Intel syntax.

There are a few options to make assembly code more readable. The -fverbose-asm option will add commentary from the original source code. It also adds the architecture and system information at the top of the file.

You would like to also remove .cfi directives from your assembly, as they are used for exception handling and debugging. To disable them use -fno-asynchronous-unwind-tables, -fno-exceptions options. Also -fno-rtti and -fno-dwarf2-cfi-asm can be useful, as explained here.

1
g++ -S -fverbose-asm -fno-asynchronous-unwind-tables -fno-dwarf2-cfi-asm -masm=intel main.cpp -o main.s

For example the following C++ code:

1
2
3
4
5
6
7
8
9
int test (int n)
{
  int total = 0;

  for (int i = 0; i < n; i++)
    total += i * i;

  return total;
}

Will produce the following assembly code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
_Z4testi:
        endbr64 
        push    rbp     #
        mov     rbp, rsp        #,
        mov     DWORD PTR -20[rbp], edi # n, n
# main.cpp:29:   int total = 0;
        mov     DWORD PTR -8[rbp], 0    # total,
# main.cpp:31:   for (int i = 0; i < n; i++)
        mov     DWORD PTR -4[rbp], 0    # i,
# main.cpp:31:   for (int i = 0; i < n; i++)
        jmp     .L9     #
.L10:
# main.cpp:32:     total += i * i;
        mov     eax, DWORD PTR -4[rbp]  # tmp85, i
        imul    eax, eax        # _1, tmp85
# main.cpp:32:     total += i * i;
        add     DWORD PTR -8[rbp], eax  # total, _1
# main.cpp:31:   for (int i = 0; i < n; i++)
        add     DWORD PTR -4[rbp], 1    # i,
.L9:
# main.cpp:31:   for (int i = 0; i < n; i++)
        mov     eax, DWORD PTR -4[rbp]  # tmp86, i
        cmp     eax, DWORD PTR -20[rbp] # tmp86, n
        jl      .L10    #,
# main.cpp:34:   return total;
        mov     eax, DWORD PTR -8[rbp]  # _7, total
# main.cpp:35: }
        pop     rbp     #
        ret     

But assembly code won’t always be as readable as this. In most useful cases, you will compile your code with optimization enabled, which will make the assembly code harder to read, as the compiler can also change the order of instructions, remove some of them, or inline some functions.

You can also try different approach. Instead of creating assembly file with cpp lines as commentary, you could create a source code with compiled assembly pieces. This can be done with (-Wa) option.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
-a[sub-option...]    turn on listings
                     Sub-options [default hls]:
                     c      omit false conditionals
                     d      omit debugging directives
                     g      include general info
                     h      include high-level source
                     l      include assembly
                     m      include macro expansions
                     n      omit forms processing
                     s      include symbols
                     =FILE  list to FILE (must be last sub-option)

We pass to assembler options adhl, so we add -Wa,-adhl. The output would not be valid assembly code, but it will be assembly list mixed with source code. You also have to pass g option to GCC, so the assembler can match source code with assembly code. I also added options to remove .cfi directives from previous example.

1
g++ -g -fno-asynchronous-unwind-tables -fno-dwarf2-cfi-asm -masm=intel -Wa,-dhln  main.cpp > main.lst

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
  26:main.cpp      ****
  27:main.cpp      **** int test (int n)
  28:main.cpp      **** {
 159                            .loc 1 28 1
 160 0000 F30F1EFA              endbr64
 161 0004 55                    push    rbp
 162                    .LCFI9:
 163 0005 4889E5                mov     rbp, rsp
 164                    .LCFI10:
 165 0008 897DEC                mov     DWORD PTR -20[rbp], edi
  29:main.cpp      ****   int total = 0;
 166                            .loc 1 29 7
 167 000b C745F800              mov     DWORD PTR -8[rbp], 0
 167      000000
 168                    .LBB4:
  30:main.cpp      ****
  31:main.cpp      ****   for (int i = 0; i < n; i++)
 169                            .loc 1 31 12
 170 0012 C745FC00              mov     DWORD PTR -4[rbp], 0
 170      000000
 171                            .loc 1 31 3
 172 0019 EB0D                  jmp     .L9
 173                    .L10:
  32:main.cpp      ****     total += i * i;
 174                            .loc 1 32 16 discriminator 3
 175 001b 8B45FC                mov     eax, DWORD PTR -4[rbp]
 176 001e 0FAFC0                imul    eax, eax
 177                            .loc 1 32 11 discriminator 3
 178 0021 0145F8                add     DWORD PTR -8[rbp], eax
  31:main.cpp      ****     total += i * i;
 179                            .loc 1 31 3 discriminator 3
 180 0024 8345FC01              add     DWORD PTR -4[rbp], 1
 181                    .L9:
  31:main.cpp      ****     total += i * i;
 182                            .loc 1 31 21 discriminator 1
 183 0028 8B45FC                mov     eax, DWORD PTR -4[rbp]
 184 002b 3B45EC                cmp     eax, DWORD PTR -20[rbp]
 185 002e 7CEB                  jl      .L10
 186                    .LBE4:
  33:main.cpp      ****
  34:main.cpp      ****   return total;
 187                            .loc 1 34 10
 188 0030 8B45F8                mov     eax, DWORD PTR -8[rbp]
  35:main.cpp      **** }
 189                            .loc 1 35 1
 190 0033 5D                    pop     rbp
 191                    .LCFI11:
 192 0034 C3                    ret

However we have those .loc directives, which are used for debugging, so in my view the result of this method looks more messy.

Another alternative, would be to use save-temps option, which will save all intermediate files (.s for assembly output, .i for preprocessed input file). This can be easily added to compiler options in your build tool. The results are localted in the build directory, inside the subdirectory for the CMake target.

Assembly from object file

Sometime you may want to see the assembly code from object file. You can use objdump tool for that. For example:

1
objdump -d -M intel main.o > deas.out

Where -d,--disassemble option tells objdump to display the assembler code, and -M option tells objdump to use Intel syntax.

Result:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
00000000000012c9 <_Z4testi>:
    12c9:	f3 0f 1e fa          	endbr64 
    12cd:	55                   	push   rbp
    12ce:	48 89 e5             	mov    rbp,rsp
    12d1:	89 7d ec             	mov    DWORD PTR [rbp-0x14],edi
    12d4:	c7 45 f8 00 00 00 00 	mov    DWORD PTR [rbp-0x8],0x0
    12db:	c7 45 fc 00 00 00 00 	mov    DWORD PTR [rbp-0x4],0x0
    12e2:	eb 0d                	jmp    12f1 <_Z4testi+0x28>
    12e4:	8b 45 fc             	mov    eax,DWORD PTR [rbp-0x4]
    12e7:	0f af c0             	imul   eax,eax
    12ea:	01 45 f8             	add    DWORD PTR [rbp-0x8],eax
    12ed:	83 45 fc 01          	add    DWORD PTR [rbp-0x4],0x1
    12f1:	8b 45 fc             	mov    eax,DWORD PTR [rbp-0x4]
    12f4:	3b 45 ec             	cmp    eax,DWORD PTR [rbp-0x14]
    12f7:	7c eb                	jl     12e4 <_Z4testi+0x1b>
    12f9:	8b 45 f8             	mov    eax,DWORD PTR [rbp-0x8]
    12fc:	5d                   	pop    rbp
    12fd:	c3                   	ret    

We can see (here and in previous examples) that the names of the functions are mangled and not easily readable. However, in objdump you can use -C option to demangle the names.

1
2
3
4
00000000000012c9 <test(int)>:
    12c9:	f3 0f 1e fa          	endbr64 
    12cd:	55                   	push   rbp
    ...

We can also see the assembly code of only one function with -disassemble=name option.

1
objdump --disassemble="test(int)" -C -M intel main.o > deassembled_test.out

Note that without -C option, the name of the function would have to be mangled.

We can also have source codes comments in our assembly with -S option, but only when the object was compiled with -g option. Another useful option is --no-show-raw-insn which doesn’t show raw bytes of machine code, -r which displays relocation entries and -w option which disables line wrapping of long machine code lines. We can shorten -C -S -r to -CSr option.

1
objdump --disassemble="test(int)" -CSr --no-show-raw-insn -M intel main.o > deassembled_test_dbg.out

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
0000000000001129 <test(int)>:
int test (int n)
{
    1129:	endbr64 
    112d:	push   rbp
    112e:	mov    rbp,rsp
    1131:	mov    DWORD PTR [rbp-0x14],edi
  int total = 0;
    1134:	mov    DWORD PTR [rbp-0x8],0x0

  for (int i = 0; i < n; i++)
    113b:	mov    DWORD PTR [rbp-0x4],0x0
    1142:	jmp    1151 <test(int)+0x28>
    total += i * i;
    1144:	mov    eax,DWORD PTR [rbp-0x4]
    1147:	imul   eax,eax
    114a:	add    DWORD PTR [rbp-0x8],eax
  for (int i = 0; i < n; i++)
    114d:	add    DWORD PTR [rbp-0x4],0x1
    1151:	mov    eax,DWORD PTR [rbp-0x4]
    1154:	cmp    eax,DWORD PTR [rbp-0x14]
    1157:	jl     1144 <test(int)+0x1b>

  return total;
    1159:	mov    eax,DWORD PTR [rbp-0x8]
}
    115c:	pop    rbp
    115d:	ret    

When using CMake

CMake creates targets for each .cpp file. Targets are named the same as the source file, but with different extension. There are targets for:

assembly code before passed to assembler, with .s extension,
preprocessed source code, with .i extension.

It is important to note, that when using many CMake subdirectories the targets are created in the binary_dir of the target’s CMakeLists.txt and not the base build directory. So if you would like to generate assembly code for a target src/foo/bar/fun.cpp you would run make src/foo/bar/fun.s from the appropriate build directory.

We can see that the generated assembly code is not very readable, beacause we have the same problems as with the previous examples. Fortunately, we can modify flags for the target with CMAKE_CXX_CREATE_ASSEMBLY_SOURCE variable, which is not documented but works.

1
set(CMAKE_CXX_CREATE_ASSEMBLY_SOURCE "<CMAKE_CXX_COMPILER> $(CXX_DEFINES) $(CXX_INCLUDES) ${CMAKE_CXX_FLAGS} -S -fverbose-asm -fno-asynchronous-unwind-tables -fno-exceptions -masm=intel <SOURCE> -o <ASSEMBLY_SOURCE>")

The names are still mangled, but we can demangle them with some auxiliary tool. Besides that, I have encountered strange bug, that add_compile_options command did not add options to the assembly target (maybe because it was redefined later) and as a consequence the assemblies targets were compiled with different flags than the release build we wanted to optimize. Also be careful to run these .s and .i targets without defining CMAKE_CXX_CREATE_ASSEMBLY_SOURCE flags explicitly, as in my case they included both the debug build flags and the release build flags.

Futher reading: