Introduction to the Stack
This is part 2 of a multi-part series.
- Stack and Heap for Swift (and other) Programmers
- The Virtual Address Space
- Introduction to the Stack (you are here)
- Limitations of the Stack (coming soon)
- Introduction to the Heap
- Stack and Heap for Swift Programmers
- Stack and Heap for SwiftUI programmers
In the last post, we discussed why programmers would want a runtime memory allocation system. We want
- to be able to store data of different sizes;
- to be able to reuse the memory of data that’s no longer needed;
- data loaded at runtime to not overwrite the instructions or constant data already loaded;
- data loaded at runtime to not overwrite another piece of runtime data;
- reserving and releasing memory locations to not take too long, so it can be used in high-performance code.
Structured programming
One solution to the runtime memory allocation problem was born out of the structured programming movement of the 1960s, which advocated building programs using higher-level structures than those provided by CPU instructions. One such high-level structure is the now ubiquitous subroutine.
Subroutines
Subroutines1 are named sequences of instructions that can be executed from within another sequence of instructions. In various programming languages they are known as procedures, functions, and methods, depending on whether or not they return a value, or operate on an object instance, or simply on the whim of the language designer. I’ll call them subroutines when writing about them in general. When writing in the context of a specific programming language, I’ll use the name most appropriate for that language.
Calling a subroutine requires:
- jumping from one place in a program’s code to the place where the subroutine’s instructions are stored;
- optionally passing arguments to the subroutine;
- executing the subroutine’s instructions;
- returning to the place from which the subroutine was called and continuing execution from there;
- optionally making any return value from the subroutine available for use.
To accomplish that we need the jumping ability, and a place to store the address of the instruction to return to, the arguments (if any) and the return value (if any).
CPUs have instructions to jump to another address and continue execution from there2. CPUs also have registers where data like return addresses, subroutine arguments, and return values can be stored for instructions to operate on, but these are limited in number. ARM643 processors like those in Apple Silicon Macs and iPhones have 31 general purpose registers that each hold 64 bits. If we used only these registers to implement subroutine calls, and every register was used to hold a return address, we could only have a chain of 31 subroutine calls (a subroutine calling another subroutine, which calls another subroutine, etc.) with no return values and no arguments—and none of the other good things registers are used for. Clearly we need more space to hold subroutine call data, which will grow and shrink at runtime as subroutines call subroutines and return. Sounds like we need a runtime memory allocator!
The stack
There is an abstract data type that is perfectly suited to the memory allocation requirements of subroutine calling: the stack. Like a Pez dispenser, a stack supports two key operations:
- data can be pushed on the top of a stack,
- and data can be popped off the top of a stack.
So a stack enforces LIFO, or Last-In-First-Out data access. This is exactly what we want: when a subroutine calls another subroutine, we can push onto a stack—let’s name it the call stack—the data needed by the called subroutine:
- the calling subroutine’s return address;
- the values of registers the called subroutine may overwrite;
- the arguments to pass to the called subroutine;
- the local variables used by the called subroutine;
- and eventually, the called subroutine’s return value.
The stack frame
This set of data is called a stack frame. The call stack contains a stack frame for every subroutine that hasn’t yet returned.
This is a generalized description of a stack frame. Not all of those items always need to be saved to the stack. The specification of the contents of a stack frame, and the responsibilities of the calling and called subroutines for populating it, are established by the calling convention for a particular platform and differ from one platform to another. We’ll examine the specific calling convention for ARM-based Apple platforms shortly.
When a subroutine returns, we can pop the most recent frame off the stack and jump to its return address. If the called subroutine calls another subroutine, we can add a new frame to the stack. The stack will grow as subroutines are called without returning, and it will shrink as subroutines return, just as required.
Managing the stack
The instructions to manage the call stack are inserted by the compiler. It’s a pretty simple data structure: the compiler needs to keep track of only two values:
- the address of the logical top of the stack, which is the end of the most recent data pushed onto the stack, regardless of whether the stack grows toward higher or lower addresses in memory;
- the address of the slot in the current stack frame that points to the previous stack frame.
Typical calling conventions devote one register to contain the stack pointer, and another to contain the frame pointer. On ARM, the stack pointer register is named sp
, and the frame pointer register is x29
, also known as fp
.
When the data for a new frame needs to be added to the stack, the compiler moves the stack pointer away from the bottom of the stack by as many bytes as need to be stored.
Platform calling convention
Away from the bottom? What? We can make this hand-wavy language concrete: for Apple platforms running on ARM, the stack grows downward towards lower memory addresses. So if 16 bytes need to be added to the stack, the address in sp
is decremented by 16.
The platform calling convention specifies the order in which items are added to the stack, and the size of each item is known at compile time, so the compiler always knows where in relation to the stack pointer it can find, for example, the second argument to the Swift function print(_:separator:terminator:)
4.
An ARM64 example
Let’s examine the actual CPU instructions the compiler inserts to manage the stack on an ARM64 platform: macOS running on Apple Silicon. We’ll compile and link a simple C program that defines and calls a function, then disassemble the resulting executable so we can examine the CPU instructions into which the compiler translated our program. That process is called disassembly because it’s the opposite of assembling human-readable assembly language into the binary machine code that lives in the executable program file.
First, let’s discuss CPU instructions and assembly language in more detail.
CPU instructions
Reduced instruction set computers, like those implementing the ARM architecture, have a relatively simple set of general purpose instructions5 they obey. There are instructions to
- load data from memory into CPU registers;
- store the contents of registers in memory;
- perform arithmetic, logical, and bitwise operations on data in registers;
- change the flow of program execution by jumping, conditionally or unconditionally, to an instruction at a particular address.
ARM64 has 31 general purpose registers, named x0
through x30
, which each hold 64 bits or 8 bytes. It also has special purpose registers, some of which we’ll learn about shortly.
Assembly language
Instructions have a human-readable form called assembly language. For example, here is the assembly language for the instruction which subtracts 0x30 (decimal 48) from the value in the stack pointer register sp
, then saves the result back into sp
.
sub sp, sp, #0x30
The destination register is conventionally the leftmost argument, followed by one or more source register or immediate value arguments.
Instruction Encoding
But this human-readable form can’t be executed by the CPU. Every ARM64 instruction—regardless of how many bytes its assembly language representation requires—has a 4-byte binary encoding, which is what the CPU reads from memory and then executes. Some of the bits in those 4 bytes specify which instruction it is, and the rest of the bits specify what arguments it’s operating on. The sub
instruction with the arguments shown above is encoded as the 4-byte hexadecimal number 0xD100C3FF, or the 32-bit binary number 0b11010001000000001100001111111111. The first 9 bits (110100010) identify the instruction as the 64-bit version of sub
, and the remaining bits specify the three arguments.
Disassembly
So a sequence of instructions and their arguments is a sequence of 4-byte numbers, and that’s what the compiler writes into the executable file. But it would take forever to manually decode each 4-byte number we find in the executable into its assembly language form. Instead, we use a program called a disassembler to do it for us. We’ll see this in action shortly.
ARM64 Procedure Call Standard
Now we’re ready to talk about the ARM64 procedure call standard. The official documentation is nearly 60 pages, but here are the essentials.
Register roles
We’ve already talked about sp
, the register dedicated to holding the address of the “top” of the stack. In addition, the standard gives two of the general purpose registers special roles for subroutine calls.
The Link Register
x30
becomes lr
, the Link Register. It holds the address of the instruction in the calling subroutine to return to when the currently-executing subroutine—the subroutine whose frame is at the top of the stack—completes.
The Frame Pointer and Frame Record
x29
becomes fp
, the Frame Pointer. Whereas sp
points to the top of the stack, fp
points to the lowest byte of the frame record in the current stack frame. The frame record is 16 bytes that every stack frame must contain.
The upper 8 bytes of the frame record preserve the contents of lr
for the previous stack frame, i.e., the return address for the previous stack frame. This guarantees that the program will be able to return from every subroutine call on the stack.
The stack is a linked list!
The lower 8 bytes of the frame record contain a pointer to the frame record in the previous stack frame. That means we can think of the stack as a linked list of stack frames! fp
points to the “head” frame of the list, which contains a pointer to the penultimate frame, which contains a pointer to the antepenultimate frame, and so on, until we reach a frame with address 0 stored in the lower 8 bytes of its frame record, which represents the tail of the list—the original function call in the program.
Registers for subroutine arguments
Registers x0
through x7
hold the first eight arguments passed by the calling subroutine, in left-to-right order as they appear in the source code. So the first argument is passed in x0
, the second in x1
, and so on.
After processing the arguments as needed to perform its work, the called subroutine can do whatever it wants with registers x0
through x15
—it doesn’t need to preserve them for the benefit of the calling subroutine. It may, for example, need to populate registers x0
through x7
with arguments to subroutines it calls.
The return value
When the called function has completed its work, it puts its return value—if there is one, and it fits in 64 bits—into x0
. It then loads fp
and lr
with the addresses it saved on the stack for the benefit of the calling function, and pops its frame off the stack by incrementing sp
back to where it was before it was called.
ARM64 Fibonacci
Let’s see all of this in action. Here’s a C program that computes the fourth Fibonacci number recursively. I’m using C because I want the disassembled code to be as simple as possible6.
#include <stdint.h>
#include <stdio.h>
uint64_t fib(const uint64_t n) {
if (n < 2) {
return n;
} else {
return fib(n - 2) + fib(n - 1);
}
}
int main(void) {
const uint64_t result = fib(4);
printf("%llu\n", result);
}
Save this program to a file named fibonacci.c
. Compile it using Clang on macOS7.
clang -o fib -g -O0 -Wall -Wextra fibonacci.c
This command compiles fibonacci.c with the following options:
- all warnings—and even extra warnings beyond that—enabled (
-Wall -Wextra
); - optimization level zero (
-0
), i.e., no optimization; - source-level debugging enabled (
-g
), because we’ll use the debugger later; - giving the name
fib
to the output executable.
We compile with no optimization because, as we’ll see in a later post, optimization may remove some or even all of the stack-management instructions we want to study!
Run the resulting executable to see that it returns the expected fourth Fibonacci number, 3.
./fib
3
Now let’s examine the disassembly. On macOS you can use otool -xV
8 to disassemble an executable program file.
otool -xV fib
And now we can see the assembly language representation of the ARM64 instructions into which Clang translated our C code.
fib:
(__TEXT,__text) section
_fib:
0000000100003ef4 sub sp, sp, #0x30
0000000100003ef8 stp x29, x30, [sp, #0x20]
0000000100003efc add x29, sp, #0x20
0000000100003f00 str x0, [sp, #0x10]
0000000100003f04 ldr x8, [sp, #0x10]
0000000100003f08 subs x8, x8, #0x2
0000000100003f0c cset w8, hs
0000000100003f10 tbnz w8, #0x0, 0x100003f24
0000000100003f14 b 0x100003f18
0000000100003f18 ldr x8, [sp, #0x10]
0000000100003f1c stur x8, [x29, #-0x8]
0000000100003f20 b 0x100003f54
0000000100003f24 ldr x8, [sp, #0x10]
0000000100003f28 subs x0, x8, #0x2
0000000100003f2c bl _fib
0000000100003f30 str x0, [sp, #0x8]
0000000100003f34 ldr x8, [sp, #0x10]
0000000100003f38 subs x0, x8, #0x1
0000000100003f3c bl _fib
0000000100003f40 mov x8, x0
0000000100003f44 ldr x0, [sp, #0x8]
0000000100003f48 add x8, x0, x8
0000000100003f4c stur x8, [x29, #-0x8]
0000000100003f50 b 0x100003f54
0000000100003f54 ldur x0, [x29, #-0x8]
0000000100003f58 ldp x29, x30, [sp, #0x20]
0000000100003f5c add sp, sp, #0x30
0000000100003f60 ret
_main:
0000000100003f64 sub sp, sp, #0x20
0000000100003f68 stp x29, x30, [sp, #0x10]
0000000100003f6c add x29, sp, #0x10
0000000100003f70 mov x0, #0x4
0000000100003f74 bl _fib
0000000100003f78 str x0, [sp, #0x8]
0000000100003f7c ldr x8, [sp, #0x8]
0000000100003f80 mov x9, sp
0000000100003f84 str x8, [x9]
0000000100003f88 adrp x0, 0 ; 0x100003000
0000000100003f8c add x0, x0, #0xfb0 ; literal pool for: "%llu\n"
0000000100003f90 bl 0x100003fa4 ; symbol stub for: _printf
0000000100003f94 mov w0, #0x0
0000000100003f98 ldp x29, x30, [sp, #0x10]
0000000100003f9c add sp, sp, #0x20
0000000100003fa0 ret
Understanding the disassembly
Forty-four ARM64 instructions. We see that they’re divided into two parts, each labeled with somewhat familiar names.
Recall that a subroutine is a named sequence of instructions. Each label of the form _<name>:
, indicates the start of the instruction sequence for the function with that name. The prefixed _
is an ancient C linker convention intended to prevent name collisions with symbols compiled from other languages.
In disassembly we also see the address of each instruction. Deciding where instructions should be stored in memory is the responsibility of the linker and operating system, so if we were writing a program in assembly language from scratch, we wouldn’t include addresses. But since this assembly language has been disassembled from machine code processed by the linker, the addresses are known and shown. Note that these addresses are located just above 0x0000000100000000, consistent with what we learned in the previous post about where instructions and constant data live in the virtual address space.
The function prologue
Decrementing the stack pointer
Focusing on _fib
, we see that its first instruction is one we studied above:
0000000100003ef4 sub sp, sp, #0x30
It subtracts 48 from the stack pointer, the address of the “top” of the stack, which we’ve learned actually grows downward, towards lower memory addresses, on Apple ARM platforms. This effectively adds a new frame of 48 bytes to the stack. We’ll soon see that only 40 of those bytes—five 8-byte values—are needed by fib
, so why spend the extra 8 bytes? The ARM Procedure Call Standard specifies that sp
must always point to an address that is “quad-word aligned”, which is a fancy way of saying it’s evenly divisible by 16.
Populating the frame record
Next is
0000000100003ef8 stp x29, x30, [sp, #0x20]
This is one of the instructions that copies, or stores, data from registers into memory. stp
is “store pair”, which stores the contents of two registers into adjacent memory locations. In this case it’s storing the contents of x29
and x30
starting at the address given by the value in sp
plus 0x20, which is inside the new stack frame just created. Storing 16 bytes starting at 32 bytes above the stack pointer, when the frame is 48 bytes long, means that the top two 8-byte slots in the frame are now occupied. In fact, the frame record has just been populated for this stack frame!
Let’s review the purpose of the frame record.
We know from the ARM procedure call standard that x29
and x30
have special meanings: they are the frame pointer fp
and the link register lr
.
Before executing this instruction, fp
points to the calling function’s frame. fp
needs to be updated to point to this call’s frame, but first its current value needs to be saved to the stack, so it can be restored for the calling function when this call returns.
lr
holds the address to execute when this function call returns. It needs to be saved to the stack because fib()
calls another function—itself!—before it returns, and as we’ll see, calling a function overwrites lr
with the address to return to when that call returns.
Once the calling function’s frame pointer is saved, fp
is updated for this function call with
0000000100003efc add x29, sp, #0x20
This adds 0x20 or decimal 32 to the stack pointer and loads the result into x29
, a.k.a. fp
. The frame pointer register now points to the memory address that holds the calling function’s frame pointer.
Together, the three instructions we’ve just studied
0000000100003ef4 sub sp, sp, #0x30
0000000100003ef8 stp x29, x30, [sp, #0x20]
0000000100003efc add x29, sp, #0x20
are known as the function prologue. They don’t implement any of the logic we wrote in the function body, but rather some of the logic required for the function call. We can think of them as happening just before the function’s opening brace.
The function body
Next we see the first instruction generated for the function body.
0000000100003f00 str x0, [sp, #0x10] ; save first argument (n) to the stack
Recall that arguments are passed in registers x0
-x7
. fib()
has one parameter, n
, so it receives one argument in x0
. But fib()
calls itself recursively, so the value of n
for this call needs to be saved to the stack, so that n
for the next call can be loaded in x0
. str
is “store register”—similar to stp
except it stores only one register to memory.
The 8-byte slot 16 bytes above the stack pointer has now been populated.
Next n
is loaded into x8
from where it was just written to the stack.
0000000100003f04 ldr x8, [sp, #0x10] ; load n from the stack into x8
For the remainder of the function, we’ll see x8
being used to hold values temporarily during computations. And that’s just what happens next.
0000000100003f08 subs x8, x8, #0x2 ; n - 2 -> x8 and set compare flags
0000000100003f0c cset w8, hs ; set w8 to 1 if n is >= 2
0000000100003f10 tbnz w8, #0x0, 0x100003f24 ; test bit 0 of w8 and branch if not 0
0000000100003f14 b 0x100003f18 ; branch to next instruction (why?)
x8
is loaded with the result of subtracting 2 from n
. The subtraction instruction is subs
instead of the sub
we’ve seen before. It means “subtract and set the status register”. Upon completion of this instruction, condition flags have been set in the status register cpsr
indicating the result of comparing its two arguments. The condition flags are used immediately by the next instruction, cset
(“conditional set”). If n
is greater than or equal to 2 (hs
means “higher or same”), w8
is set to 1.
Wait, what is w8
? ARM64’s 64-bit general purpose registers can also be used as 32-bit registers, for example, in cases where we want to perform 32-bit instead of 64-bit arithmetic. To use 32-bit registers with instructions that support them, the register names are specified with w
instead of x
in their names. Not certain why that is done in this case, but since the next instruction, tbnz
, uses only the lowest-order bit of register 8, the highest-order 32 bits are certainly irrelevant.
tbnz
is “test bit and branch if not zero”, which branches to 0x100003f24 if bit 0 of w8
is not zero—which is the case if n >= 2. If not, execution continues at the next instruction, which is an unconditional branch to 0x100003f18. Which is just the next instruction after that, so I don’t know why a branch is used.
The instruction sequence from 0x100003f18 through 0x100003f20 handles the case where n
is less than 2 and so we simply return n
. The sequence starting at 0x100003f24 handles n
greater than or equal to 2, where we want to sum the results of fib(n - 2) and fib(n - 1). So we see that 0x100003f08 through 0x100003f14 implements the if
/ else
in
if (n < 2) {
return n;
} else {
return fib(n - 2) + fib(n - 1);
}
return n
is implemented as
0000000100003f18 ldr x8, [sp, #0x10] ; reload n from the stack into x8
0000000100003f1c stur x8, [x29, #-0x8] ; store n in return value slot of stack
0000000100003f20 b 0x100003f54 ; n is the result, so branch to epilogue
The value in x8
, which is n
, is stored on the stack in the 8 bytes below the frame pointer.
This slot in the stack is where the return value is kept before the function epilogue is executed. The epilogue is the prologue’s mirror image: code the compiler inserts to return from a subroutine. We’ll examine it shortly.
How to call a function
The rest of the instructions in fib()
implement return fib(n - 2) + fib(n - 1)
, our first chance to see how a subroutine is called.
0000000100003f24 ldr x8, [sp, #0x10] ; load n into x8
0000000100003f28 subs x0, x8, #0x2 ; n - 2 -> x0 in order to call fib(n - 2)
0000000100003f2c bl _fib ; call fib
fib(n - 2)
is called first. x8
is reloaded with the value of n
saved on the stack, then used to compute n - 2
. The result of n - 2
is loaded into x0
, the register used for the first argument to a function. fib(n - 2)
is then called using the bl
instruction, “branch with link”.
Like the b
we’ve seen already, bl
jumps to the instruction at the specified address and resumes execution there. But bl
also stores the address of the instruction after bl
into the link register lr
. So now we now how lr
was set before fib()
was called. When the called function returns, execution will resume at 0x100003f30.
As we learned from the ARM Procedure Call Standard, a function loads its return value into x0
before returning. fib(n - 1)
will be called next, which will overwrite x0
with its own return value, so the first order of business after the call to fib(n - 2)
is to store its return value to the stack.
0000000100003f30 str x0, [sp, #0x8] ; store result of fib(n - 2) on stack
We store it in the 8-byte slot above the stack pointer, so now we’ve seen the roles of all six of the 8-byte slots in this stack frame. From the highest address to the lowest:
- <- previous stack pointer
- this function call's return address (upper 8 bytes of frame record)
- pointer to calling function's frame (lower 8 bytes of frame record) <-
fp
- this function call's return value
- this function call's argument
n
- result of
fib(n - 2)
- padding for quad-word (16-byte) alignment <-
sp
Slot 0 is not part of this function call’s stack frame, but it indicates where sp
was pointing before its value was decremented to create this frame.
Next, the temporary value register x8
is reloaded with n
, and the result of n - 1
is loaded into x0
, so fib(n - 1)
can be called.
0000000100003f34 ldr x8, [sp, #0x10] ; load n into x8
0000000100003f38 subs x0, x8, #0x1 ; n - 1 -> x0 in order to call f(n - 1)
0000000100003f3c bl _fib ; call fib(n - 1)
After fib(n - 1)
returns, its result is moved from x0
to x8
. x0
is loaded with the result of fib(n - 2)
that had been saved on the stack, and then the result of adding x8
to x0
is loaded into x8
.
0000000100003f40 mov x8, x0 ; move result of fib(n - 1) into x8
0000000100003f44 ldr x0, [sp, #0x8] ; load result of fib(n - 2) from stack into x0
0000000100003f48 add x8, x0, x8 ; result fib(n - 2) + fib(n - 1) -> x8
Now x8
holds fib(n - 2) + fib(n - 1)
, exactly what we want to return from this branch of the if…else. But first it’s stored back on the stack in the slot used for this function call’s result. I’m not sure why, but I suspect this will disappear if we compile with optimizations.
0000000100003f4c stur x8, [x29, #-0x8] ; store result on stack
0000000100003f50 b 0x100003f54 ; (why?)
The function epilogue
Now we’ve reached the epilogue. Whether we arrived here from the return n
or the return fib(n - 2) + fib(n - 1)
branch, the slot just below the frame pointer holds the result to return. The epilogue loads x0
with the result, since, as we’ve seen, x0
is where the calling function expects to find the return value.
0000000100003f54 ldur x0, [x29, #-0x8] ; epilogue: load x0 with result
The rest of the epilogue reverses the work of the prologue. x29
and x30
, a.k.a. fp
and lr
, are loaded with the values saved on the stack. Then the stack pointer is incremented by the same amount it was decremented to create this call’s stack frame. That effectively pops this frame off the stack, freeing the 48 bytes it occupied for use by some future function call.
0000000100003f58 ldp x29, x30, [sp, #0x20] ; restore previous fp and lr
0000000100003f5c add sp, sp, #0x30 ; pop this function's frame off the stack
Finally, we execute ret
, which jumps back to the instruction lr
points to, which is the instruction after the bl
that led us here.
0000000100003f60 ret ; return to instruction whose address is in lr
This call of fib()
has returned!
Multiple stack frames
We’ve now seen how the stack frame is managed for one call of fib()
. Next, let’s examine the stack when it contains multiple frames.
The implementation of fib()
calls itself recursively, with the termination condition that fib(0) = 0
and fib(1) = 1
by definition. The fib
program’s main()
function simply prints the result of fib(4)
. Here’s the call graph for fib(4)
. Each node also shows the result of the call for that value of n
.
We can see that the maximum number of fib()
calls that will be on the stack is four. A quick way to get to that point in the program is to set a breakpoint in fib()
for the second time n == 0
(we know from the disassembly that fib()
calls fib(n - 2)
first, then fib(n - 1)
). The call stack we’ll examine is highlighted in blue.
Viewing stack memory in the debugger
Load fib
into the debugger.
lldb fib
(lldb) target create "fib"
Current executable set to '/Users/tt/src/fib' (arm64).
Set a breakpoint just before we return from fib()
that will trigger the second time n == 0
.
(lldb) b --line 10 --condition "n == 0" --ignore-count 1
Breakpoint 1: where = fib`fib + 96 at fibonacci.c:10:1, address = 0x0000000100003f54
Line 10 is the closing brace of the C source of fib()
. Note that the address of the instruction the debugger will stop on is 0x0000000100003f54, which we now know is the first instruction of the prologue!
Run the program.
(lldb) run
Process 27184 launched: '/Users/tt/src/mac-os-x-internals/fib' (arm64)
Process 27184 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x0000000100003f54 fib`fib(n=0) at fibonacci.c:10:1
7 } else {
8 return fib(n - 2) + fib(n - 1);
9 }
-> 10 }
11
12 int main(void) {
13 const uint64_t result = fib(4);
Target 0: (fib) stopped.
As requested, the program is stopped just before returning from fib()
. Let’s see if the frames on the stack match what we expect from the call graph:
- fib(0) <-
sp
- fib(2)
- fib(3)
- fib(4)
- main()
We can apply our knowledge of the ARM procedure call standard to the prologue and first few instructions of main()
.
_main:
0000000100003f64 sub sp, sp, #0x20 ; push a frame of 32 bytes onto the stack
0000000100003f68 stp x29, x30, [sp, #0x10] ; create the frame record in the upper 16 bytes
0000000100003f6c add x29, sp, #0x10 ; point sp at this frame's record
0000000100003f70 mov x0, #0x4 ; prepare to call fib(4)
0000000100003f74 bl _fib ; call fib(4)
0000000100003f78 str x0, [sp, #0x8] ; save result of fib(4) to the stack
We see main()
’s stack frame is 32 bytes: the two 8-byte addresses of the frame record, and at least one more 8-byte value for the result of fib(4)
. We don’t yet know what the remaining 8 bytes are used for. We know that fib()
’s stack frames contain six 8-byte values, so to see all five frames we want to examine, we should print 28 8-byte values starting from the address in sp
. fp
will be useful to know as well.
(lldb) reg read sp fp
sp = 0x000000016fdff160
fp = 0x000000016fdff180
(lldb) memory read --size 8 --num-per-line 1 --count 28 --format x 0x000000016fdff160
0x16fdff160: 0x0000000000000000
0x16fdff168: 0x0000000000000000
0x16fdff170: 0x0000000000000000
0x16fdff178: 0x0000000000000000
0x16fdff180: 0x000000016fdff1b0
0x16fdff188: 0x0000000100003f30
0x16fdff190: 0x0000000000000000
0x16fdff198: 0x0000000000000000
0x16fdff1a0: 0x0000000000000002
0x16fdff1a8: 0x0000000000000001
0x16fdff1b0: 0x000000016fdff1e0
0x16fdff1b8: 0x0000000100003f40
0x16fdff1c0: 0x0000000000000000
0x16fdff1c8: 0x0000000000000001
0x16fdff1d0: 0x0000000000000003
0x16fdff1d8: 0x0000000000000001
0x16fdff1e0: 0x000000016fdff210
0x16fdff1e8: 0x0000000100003f40
0x16fdff1f0: 0x0000000000000000
0x16fdff1f8: 0x0000000000000001
0x16fdff200: 0x0000000000000004
0x16fdff208: 0x000000019d42a366
0x16fdff210: 0x000000016fdff230
0x16fdff218: 0x0000000100003f78
0x16fdff220: 0x000000010000c000
0x16fdff228: 0x0000000100003f64
0x16fdff230: 0x000000016fdff4a0
0x16fdff238: 0x000000019d3aff28
Hmmm. lldb
has printed the addresses increasing downward, while we’ve been looking at diagrams with the addresses increasing upward. But this matches how we typically see call stacks in a debugger, with the most recent call on top. This output is like flipping the stack diagrams we’ve been studying upside-down. Here’s the memory dump again with helpful formatting and comments. Keep in mind the program is just about to return from the call to fib(0)
on the right side of the call graph.
------------fib(0)-------------
0x16fdff160: 0x0000000000000000 ; padding for quad-word (16-byte) alignment <- SP
0x16fdff168: 0x0000000000000000 ; result of fib(n - 2), not called when n < 2
0x16fdff170: 0x0000000000000000 ; value of n. This is fib(0)
0x16fdff178: 0x0000000000000000 ; result of this call
0x16fdff180: 0x000000016fdff1b0 ; pointer to frame of caller, fib(2) <- FP
0x16fdff188: 0x0000000100003f30 ; return address, just after call to fib(n - 2) in fib(2)
------------fib(2)-------------
0x16fdff190: 0x0000000000000000 ; padding for quad-word (16-byte) alignment
0x16fdff198: 0x0000000000000000 ; result of fib(n - 2), not set until fib(0) returns
0x16fdff1a0: 0x0000000000000002 ; value of n. This is fib(2)
0x16fdff1a8: 0x0000000000000001 ; result of this call, not yet set
0x16fdff1b0: 0x000000016fdff1e0 ; pointer to frame of caller, fib(3)
0x16fdff1b8: 0x0000000100003f40 ; return address, just after call to fib(n - 1) in fib(3)
------------fib(3)-------------
0x16fdff1c0: 0x0000000000000000 ; padding for quad-word (16-byte) alignment
0x16fdff1c8: 0x0000000000000001 ; result of fib(n - 2) == fib(1)
0x16fdff1d0: 0x0000000000000003 ; value of n. This is fib(3)
0x16fdff1d8: 0x0000000000000001 ; result of this call, not yet set
0x16fdff1e0: 0x000000016fdff210 ; pointer to frame of caller, fib(4)
0x16fdff1e8: 0x0000000100003f40 ; return address, just after call to fib(n - 1) in fib(4)
------------fib(4)-------------
0x16fdff1f0: 0x0000000000000000 ; padding for quad-word (16-byte) alignment
0x16fdff1f8: 0x0000000000000001 ; result of fib(n - 2) == fib(2)
0x16fdff200: 0x0000000000000004 ; value of n. This is fib(4)
0x16fdff208: 0x000000019d42a366 ; result of this call, not yet set
0x16fdff210: 0x000000016fdff230 ; pointer to frame of caller, main()
0x16fdff218: 0x0000000100003f78 ; return address, just after call to fib(4) in main()
------------main()-------------
0x16fdff220: 0x000000010000c000 ; study the rest of main() to find out
0x16fdff228: 0x0000000100003f64 ; result of fib(4), not yet set
0x16fdff230: 0x000000016fdff4a0 ; pointer to frame of unknown caller
0x16fdff238: 0x000000019d3aff28 ; return address in unknown caller
All is as we expect, with a little bit of mystery about who called main! We can find out by asking lldb to print a stack trace.
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
* frame #0: 0x0000000100003f54 fib`fib(n=0) at fibonacci.c:10:1
frame #1: 0x0000000100003f30 fib`fib(n=2) at fibonacci.c:8:10
frame #2: 0x0000000100003f40 fib`fib(n=3) at fibonacci.c:8:23
frame #3: 0x0000000100003f40 fib`fib(n=4) at fibonacci.c:8:23
frame #4: 0x0000000100003f78 fib`main at fibonacci.c:13:26
frame #5: 0x000000019d3aff28 dyld`start + 2236
This is a compact representation of everything we learned from examining the stack in memory, including the value of n
for each call, and the exact character in the source representing each return address. We also see that the mystery caller of main()
is start()
in some binary named dyld
. dyld
’s instructions are loaded into much higher memory addresses than fib
’s instructions.
Location and size of the stack in memory
We see that the stack itself is also located higher in memory than fib
’s instructions. The top of the stack is at 0x16fdff160, according to sp
, and the upper 8 bytes of main()
’s frame record are at 0x16fdff238. According to thread backtrace
there’s only one other frame on the stack, the call to dyld
’s start()
. We don’t know the exact size of start()
’s stack frame, but it’s at least 16 bytes for its frame record. So at this point in program execution the stack consumes at least 30 ✕ 8 or 240 bytes of our program’s memory.
Recall that on Apple’s ARM platforms, the stack grows downward in memory. It can’t grow forever though, because we know from our study of the virtual address space that fib
’s instructions are loaded just above 0x000000100000000 in a read-only segment. How large might the stack get? Does the operating system limit its size? What happens if it gets too big?
Join me in the next post to learn more about the limitations of the stack, and why there is another, entirely different runtime memory allocation system—the heap!
-
The “sub” in the name reflects that a subroutine is always subordinate to its caller—it has only one entrance and it runs to completion before returning to its caller. A subroutine can’t invert its relationship to its caller by calling back into it. Routines that can call back and forth into each other from anywhere in their sequence of instructions are called coroutines.
async
functions in Swift are coroutines! ↩ -
In fact, structured programming was developed to improve the comprehensibility of assembly language programs that used jump instructions in an arbitrary fashion, rather than under the control of structures like subroutines. ↩
-
ARM is rather fussy about terminology, and “ARM64” isn’t a name they use officially. ARM defines an architecture which specifies instruction sets, register sets, exception models, and memory models, among other details, for different usage profiles. The A (Applications) profile, for example, is for running full-featured operating systems like iOS or Linux. ARMv8-A was the first version of the ARM A-profile architecture to introduce 64-bit addressing and registers, a feature set known collectively as AArch64. The instruction set of AArch64 is named A64. I think AArch64 is too obscure, and A64 is too vague, to use in an introductory discussion. “arm64” is widely used in Apple’s toolchain to mean ARM’s 64-bit architecture; I think capitalizing the letters makes that name a bit more formal while retaining its recognizability, so I’ll use “ARM64” to label any and all parts of ARM’s 64-bit architecture in this series of posts. ↩
-
If you’re wondering how the size of the first argument can be known at compile time, stay tuned for a future post in this series! ↩
-
Granted, if you consider special-purpose instructions, like those for SIMD, it’s less simple. ↩
-
The safety and convenience features of a modern compiled language like Swift or Rust are implemented in code generated by the compiler, which is visible in the disassembly of even very simple programs. This code is important and valuable, but could be a distraction when first learning to read assembly language. ↩
-
I used macOS 13.3.1 and Xcode 14.3. ↩
-
-x
tellsotool
to display the assembled contents of every__text
section found in the specified file. Alternatively,-t
displays the contents of only__TEXT.__text
. In either case,-V
, which in our usage is combined with-x
into a single-xV
argument, performs disassembly on the contents read from the file. ↩