Optimization guide

Clang is a modern, optimizing C compiler, and LLVM-MOS's fork is no exception. Such a compiler is a powerful tool, but the 6502 is very much an embedded platform by modern standards, and it takes some strategy to get the most out of a modern compiler on embedded platforms. Some of the standard cc65 guidance can prevent getting the best results of LLVM-MOS, but writing totally naive cross-platform C doesn't necessarily give very good results either.

The purpose of this guide is to help the reader build a simple, but accurate, mental model of how Clang and LLVM-MOS work and how to help them generate better code.

Optimizing Compilers[edit | edit source]

If they had optimization at all, older compilers tended to exclusively use "peephole optimization". They would emit a fixed assembly pattern for each C construct. Afterwards, they would run optimization passes over the results later, looking for known poor assembly patterns and replacing them with better ones.

You can get surprisingly far this way, but fairly quickly you run into a local optimum. Eventually, there won't be a small, simple, local transformation that can improve the assembly, but the result is still much worse than a human would write. You'd need advanced, global knowledge to produce better assembly, and you might have to make it worse before you make it better.

Modern compilers generally don't work like this; instead, they gradually massage the C a bit at a time. LLVM does this in a huge (100+) number of passes. Each pass either analyzes the code or transforms it. Some passes are simple peephole passes. Others gradually make bits of the program less abstract and more concrete. These may decide where to put values, how to lay out the stack, or how to copy values from one register to another. As the passes continue, the in-memory representation of the program gradually comes to resemble 6502 assembly more and more, until it maps to real assembly instructions one to one. At that point, the result is serialized to machine code and sent onward to the linker.

The key thing to take from this section is that LLVM essentially picks your program apart and reassembles it by the time it reaches assembly. In particular, things like variables and temporaries and expressions get almost entirely lost in the soup; eventually, all the compiler cares about is a graph of values, some depending on others for their computation, where they'll be put, what they need, and where they'll end up.

Rules[edit | edit source]

Here's a succinct set of rules to follow to get good code with Clang/LLVM-MOS:

Use -Os and -flto. These are the defaults for all targets, but they aren't automatically applied when compiling code snippets on the command line. LLVM-MOS seems to produce much better code overall when optimizing for size than for speed, and LTO enables a whole host of whole-program optimizations vital to getting good 6502 codegen.
Use -fnonreentrant if you have function pointers. The compiler automatically tries to determine whether or not recursion is occurring, and it does an excellent job, except in the presence of function pointers. This is a limitation we'd like to remove, but in the meantime, you can use the -fnonreentrant compiler flag or the __attribute__((nonreentrant)) function attribute to declare that a function is known never to have more than one instance active at a time. Without this, the compiler conservatively falls back to placing local variables on a soft stack, much like cc65 (although not nearly as poorly).
If not using -fnonreentrant, make sure to annotate your inline assembly and assembly function declarations in C with __attribute__((leaf)). Otherwise, the compiler must pessimize what those calls could in turn call, which tends to also make it fall back to using a soft stack for the affected functions.
Don't use global variables just for performance's sake. These are usually lowered to loads and stores to fixed locations (although sometimes the optimizer can see through them). On the other hand, a local variable or expression may live in completely different locations as the program progresses. For example, a variable may move into A to have some arithmetic done on it, and the result move to X for some indexing. Placing the result in a global could make it visible to another function though, so the compiler has a harder time avoiding the store.
Don't use non-static function local constant arrays. Even if the values of these arrays are constant, the function might be called recursively, and the C standard requires that the arrays in different invocations of the function have different pointers. The compiler guarantees this by making a per-invocation copy of the array's initializer. While LLVM-MOS can currently elide the stack pointer use, it cannot yet elide the copy. It's easy to prevent though: just make the array static or global.
Don't use the full results of wide integer calculations. It's often fine to compute values in wide types, so long as you only use part of the result. The compiler is usually pretty good at trimming down the computation. But, if you store a 64-bit sum somewhere in global memory, well, the compiler will probably give you what you've asked for.
Don't reuse the same temporaries everywhere as an optimization. LLVM's register allocator is pretty darn good at assigning temporary values to memory locations and registers; there's usually no need to manually play register allocator. Pre-determining what values are going to share the same memory locations can also reduce the compiler's ability to do so, so you may get worse code for your effort, too.
Use structs of arrays, not arrays of structs. We'd like to have a transformation from one to the other, but until we do, the usual 6502 C guidance applies here.
Prefer array indices of <256. LLVM-MOS can select the full gamut of addressing modes, but it's not going to completely rearrange your data structures for you. It's up to you to size things such that it's possible to use them.