Benchmarking ticktrace vs pico-sdk
This is how we put numbers on "every cycle matters". The benchmark
suite lives in benchmarks/, builds on both sides with the same shape,
and emits parseable BENCH … lines so a host script can tabulate
results without any per-bench glue.
Layout
benchmarks/
rp_asm/
bench_lib.S shared output / init helpers
bench_minimum.S smallest blinker
bench_gpio_toggle.S 100k× gpio_toggle, DWT cycles
bench_sha256_64k.S hash 64 KiB
bench_dma_memcpy.S DMA vs CPU 16 KiB copy
bench_irq_latency.S TIMER alarm -> ISR delta, 32 samples
pico_sdk/
README.md build recipe with pico-sdk
bench_*/main.c paired C source
bench_*/CMakeLists.txt minimal pico-sdk build glue
run.sh flash + capture BENCH lines from UART0
results.md running scoreboard with observed numbers
Build (ticktrace side)
make bench # build every benchmarks/rp_asm/bench_*.uf2
make bench-sizes # print image-size table (no flash needed)
The bench target follows the same examples auto-discovery pattern
already in use for examples/*.S, plus links each bench against
benchmarks/rp_asm/bench_lib.S for the shared bench_emit /
bench_standard_init helpers.
Build (pico-sdk side)
benchmarks/pico_sdk/README.md has the full recipe. TL;DR:
export PICO_SDK_PATH=~/pico-sdk
export PICO_BOARD=pico2
cd benchmarks/pico_sdk/bench_gpio_toggle
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j
Every C bench emits the exact same BENCH name=… metric=… value=0x…
format that the ticktrace side does, so benchmarks/run.sh is
SDK-agnostic.
Run on hardware
benchmarks/run.sh build/bench_gpio_toggle.uf2 /dev/ttyACM0 5
# BENCH name=gpio_toggle metric=cycles_total value=0x000c350f
# BENCH name=gpio_toggle metric=cycles_per_iter_x256 value=0x000007ff
Then flip to the pico-sdk side, repeat, and diff.
Methodology
Output format
One line per measurement:
BENCH name=<bench> metric=<metric> value=0x<8 hex digits>
Hex keeps the print path to ~60 bytes of code on the ticktrace side
(no /-by-10 formatting routines). The "value" field is always a
32-bit unsigned cycle count or fixed-point ratio, never a float;
floats would tilt the comparison against pico-sdk for no good
reason.
Fairness rules
| Rule | Why |
|---|---|
| Same hardware: Pico 2 (RP2350-A2) | Eliminates silicon-rev differences |
Same clock: clk_sys = 150 MHz |
Both M2 (pll_sys_150_mhz) and pico-sdk's set_sys_clock_khz(150000, true) |
pico-sdk release build, -O3 -flto -DNDEBUG |
Debug builds add assert() calls, ~3× cycles per public API |
Direct DWT->CYCCNT read on both sides |
Avoids the ~30-cycle time_us_32 overhead |
| Steady-state only for throughput numbers | Init costs amortise to nothing on long runs |
ISR bodies marked __not_in_flash_func() (C side) |
Keeps the C bench out of XIP wait states for parity with the ticktrace side which is SRAM-resident |
What we expect to win, where we expect to lose
| Category | Likely winner | Why |
|---|---|---|
| Minimum image size | ticktrace | No crt0, no stdio, no _init_array machinery |
| Boot to first user code | ticktrace | No .data copy, no BSS zero, no stdio init |
| Per-call leaf cycles | tie or ticktrace | Both end up at the same STR eventually; SDK pays wrapper overhead in non-LTO builds |
| Throughput (SHA, DMA) | tie | Both bottleneck on the peripheral, not the CPU |
| IRQ entry latency | small ticktrace | SDK's __isr wrapper adds a few cycles |
| Toolchain wall-clock | ticktrace by 30–100× | No C compiler, no CMake configure step |
What we will lose at (and shouldn't pretend otherwise)
- Portability. pico-sdk runs on RP2040 and RP2350, and on every Pico variant. We target Pico 2 + RP2350-Arm-Secure specifically.
- Ecosystem. TinyUSB, lwIP, Bluetooth, WiFi, RTOSes all integrate with pico-sdk out of the box. We have UART, USB CDC-ACM echo, and vibes.
- Asserts / debug builds. pico-sdk's debug builds catch a category of bugs (bad pin number, peripheral not in reset, etc.) that we cheerfully don't. That's a deliberate trade-off, not a win.
Caveats
- Image-size numbers will look more flattering to ticktrace than reality
because pico-sdk pulls in
stdiowhether you use it or not. A fairer version ofbench_minimumwould strip stdio on the SDK side; we don't, because "I want printf" is the default expectation. - Some metrics (boot time, IRQ entry-to-pin-toggle) require a logic analyser to measure to nanosecond precision. The DWT-based versions are an in-band approximation that compares apples-to-apples if you don't have one.
- Wall-clock build time depends heavily on your machine and pico-sdk caching state. We report cold-cache numbers.
What's in the suite today
| Bench | ticktrace .text (bytes) |
What it measures |
|---|---|---|
bench_minimum |
224 | Image size of "blink GP25, nothing else" |
bench_gpio_toggle |
1092 | 100k × gpio_toggle(25), DWT cycles |
bench_dma_memcpy |
1176 | 16 KiB SRAM→SRAM via DMA + CPU loop |
bench_irq_latency |
1432 | TIMER0 alarm → ISR delta, 32 samples |
bench_sha256_64k |
66880 (64 KiB payload) | SHA-256 of 64 KiB, cycles + MB/s×100 |
bench_sha256_64k's "text" is dominated by the 64 KiB .rodata
payload; the actual code is ~600 bytes.
Roadmap
- Capture observed numbers on real Pico 2 and fill in
benchmarks/results.md. - Add
bench_uart_tx_throughput(push 16 KiB out UART0 at 1 Mbps, report wall-clock cycles). - Add
bench_pwm_update_rate(change PWM duty in a tight loop). - Add
bench_boot_to_mainwith a GPIO pulse from_resetto the first user instruction for logic-analyser capture. -
make comparetarget that runs both SDKs back-to-back and emits a markdown diff table.