Turmeric compiles to optimised C99 (release builds use -O2 by default). This
means its performance ceiling is close to hand-written C, but the patterns you
choose matter. This guide covers the major performance dimensions -- numerical
computation, data structures, string processing, concurrency, memory and GC,
recursion, and I/O -- and finishes with a methodology section for writing and
interpreting benchmarks.
Always benchmark release builds:
just release # builds build-rel/tur
The debug build (just build) inserts contract checks and disables
optimisations; its timing numbers are not meaningful for comparison.
Arithmetic on int and float compiles to the corresponding C types.
There is no boxing overhead for scalars declared with concrete types:
(defn square [x] :int (* x x))
(defn hyp [a b] :float
(sqrt (+ (* a a) (* b b))))
Avoid leaving numeric expressions untyped in hot loops -- the elaborator may widen to a tagged value when it cannot infer a concrete numeric type.
Iterative is faster for large N because it avoids stack growth:
; iterative -- O(n) time, O(1) space
(defn fib-iter [n] :int
(let [loop (fn [i a b] :int
(if (= i 0)
a
(loop (- i 1) b (+ a b))))]
(loop n 0 1)))
; recursive -- O(2^n) time, avoid for n > ~30
(defn fib-rec [n] :int
(if (< n 2)
n
(+ (fib-rec (- n 1)) (fib-rec (- n 2)))))
The Sieve of Eratosthenes benefits from a vec (growable array backed by a C
malloced block) rather than a linked list:
(import "stdlib/vec.tur")
(defn sieve [limit] :vec
(let [flags (vec/make limit true)]
(vec/set! flags 0 false)
(vec/set! flags 1 false)
(let [loop (fn [i] :void
(when (<= (* i i) limit)
(when (vec/get flags i)
(let [inner (fn [j] :void
(when (<= j limit)
(vec/set! flags j false)
(inner (+ j i))))]
(inner (* i i))))
(loop (+ i 1))))]
(loop 2))
flags))
Use rand (from stdlib/rand.tur) rather than importing <stdlib.h> via
inline-C so the compiler can see through calls and optimise the loop:
(import "stdlib/rand.tur")
(defn estimate-pi [samples] :float
(let [loop (fn [i inside] :float
(if (= i 0)
(* 4.0 (/ (int->float inside) (int->float samples)))
(let [x (rand/float)
y (rand/float)]
(loop (- i 1)
(if (<= (+ (* x x) (* y y)) 1.0)
(+ inside 1)
inside)))))]
(loop samples 0)))
Use list / cons for functional transformations where sharing is important.
Use vec for index-heavy access and mutation-in-place:
| Operation | list |
vec |
|---|---|---|
| Prepend | O(1) | O(n) amortised |
| Random read | O(n) | O(1) |
| Append (single) | O(n) | O(1) amortised |
| Memory per element | 2 words (pointer + tag) | 1 word |
stdlib/hamt.tur (the persistent hash-array-mapped trie) is the standard map.
For hot paths that need mutable semantics, combine hamt with ref:
(import "stdlib/hamt.tur")
(import "stdlib/ref.tur")
(defn freq-count [words] :hamt
(let [m (ref (hamt/empty))]
(list/for-each words
(fn [w] :void
(ref/update! m
(fn [h] :hamt
(hamt/insert h w (+ 1 (hamt/get-or h w 0)))))))
(ref/get m)))
HAMT operations are O(log₃₂ n) in practice -- fast for lookups but slower
than a mutable hash table for write-heavy workloads. If you need the latter,
reach for an inline-C wrapper around uthash or a fixed-size open-addressing
table.
stdlib/vec.tur exposes vec/sort! (in-place quicksort) and list/sort
(merge sort returning a new list):
(import "stdlib/vec.tur")
; in-place, cache-friendly
(let [v (vec/of 5 3 8 1 9 2)]
(vec/sort! v <)
v)
str/concat allocates a new buffer each call. For building large strings from
many pieces, use str/builder:
(import "stdlib/str.tur")
(defn join [sep parts] :str
(let [b (str/builder)]
(list/for-each-indexed parts
(fn [i s] :void
(when (> i 0)
(str/builder/append! b sep))
(str/builder/append! b s)))
(str/builder/finish b)))
For simple substring search, str/index-of wraps strstr and is O(n·m) in
the worst case. For repeated pattern matching over a corpus, compile the
pattern once and reuse the handle -- the search itself is then O(n):
(import "stdlib/regex.tur")
(defn count-matches [pattern text] :int
(let [re (regex/compile pattern)]
(regex/count re text)))
str/view over copyingstr/view is a non-owning slice into an existing string. Use it when you only
need to inspect a substring -- no allocation, no copy:
(let [s "hello world"]
(str/view s 6 11)) ; => "world", zero-copy
Threads are spawned with spawn (from stdlib/concurrency.tur):
(import "stdlib/concurrency.tur")
(defn parallel-map [f xs] :list
(let [handles (list/map xs (fn [x] :handle (spawn (fn [] :any (f x)))))]
(list/map handles (fn [h] :any (await h)))))
Each spawn creates a POSIX thread. Thread creation overhead is several
microseconds; avoid spawning threads inside tight loops.
Use chan for lock-free message passing between threads:
(import "stdlib/chan.tur")
(defn pipeline [producer-fn consumer-fn n] :void
(let [c (chan/make 64)]
(spawn (fn [] :void (producer-fn c)))
(spawn (fn [] :void (consumer-fn c)))
(chan/close c)))
Channel sends block when the buffer is full; size the buffer to amortise context-switch cost for your throughput target.
defvar / with-var (from stdlib/dynamic-vars.tur) provide thread-local
dynamic bindings. Reads have no locking overhead -- they index directly into a
thread-local slot:
(import "stdlib/dynamic-vars.tur")
(defvar *indent* 0)
(defn with-indent [body] :any
(with-var [*indent* (+ *indent* 2)] (body)))
Turmeric uses a precise, tracing garbage collector. Short-lived objects are cheap to allocate but incur GC work proportional to their number. For allocation-heavy workloads:
int, float, bool) -- they
never hit the GC.vec buffers with vec/clear! instead of allocating a fresh vec each
iteration.ref sparingly; each ref is a heap cell.The simplest way to measure GC impact is to compare a version that allocates aggressively against one that reuses buffers:
; high churn -- allocates a new list each iteration
(defn churn [n] :void
(let [loop (fn [i] :void
(when (> i 0)
(let [_ (list/range 0 1000)]
(loop (- i 1)))))]
(loop n)))
; low churn -- reuses a vec
(defn no-churn [n] :void
(let [v (vec/make 1000 0)]
(let [loop (fn [i] :void
(when (> i 0)
(vec/fill! v 0)
(loop (- i 1))))]
(loop n))))
The compiler performs tail-call optimisation (TCO) on self-tail calls. A
function whose recursive call is in tail position compiles to a C goto loop
and uses O(1) stack:
; tail-recursive -- safe for any n
(defn factorial [n acc] :int
(if (= n 0)
acc
(factorial (- n 1) (* n acc))))
(factorial 1000000 1)
Non-tail recursion is not optimised and will overflow the stack for deep
inputs. When you cannot restructure to tail position, use an explicit stack
(a list or vec as a workaround).
Mutually recursive functions are currently not tail-call optimised across the function boundary. Rewrite mutual recursion as a single function with a discriminant, or use a trampoline:
(import "stdlib/trampoline.tur")
(defn even? [n] :thunk
(if (= n 0)
(done true)
(bounce (fn [] :thunk (odd? (- n 1))))))
(defn odd? [n] :thunk
(if (= n 0)
(done false)
(bounce (fn [] :thunk (even? (- n 1))))))
(trampoline/run (even? 100000)) ; => true, O(1) stack
stdlib/io.tur wraps fread/fwrite. For sequential reads of large files,
use a buffer size that matches your OS page size (typically 4096 or 65536
bytes):
(import "stdlib/io.tur")
(defn read-file [path] :str
(let [f (io/open path "rb")]
(let [s (io/read-all f)]
(io/close f)
s)))
io/read-all reads in 65536-byte chunks internally; do not loop over
io/read-byte for large files -- the per-call overhead dominates.
Writes are buffered by the C FILE* layer. Flush explicitly only when
durability is required:
(defn write-lines [path lines] :void
(let [f (io/open path "wb")]
(list/for-each lines (fn [line] :void
(io/write f line)
(io/write f "\n")))
(io/close f))) ; flush happens here
Calling io/flush inside the loop adds a syscall per line; omit it unless you
need crash-safety.
Float-heavy simulations benefit from struct-of-arrays layout when possible.
Turmeric structs are currently records (array-of-structs), so prefer
separating coordinate vectors into dedicated vecs if profiling reveals cache
pressure:
(defstruct body [x :float y :float z :float
vx :float vy :float vz :float
mass :float])
Ray-box intersection and dot products are the hot paths. Annotate return types
concretely (float) so the elaborator does not insert tag checks in the inner
loop:
(defn dot [ax ay az bx by bz] :float
(+ (* ax bx) (+ (* ay by) (* az bz))))
Each benchmark file should follow this template so that the automated runner
(scripts/run_all.sh) can collect results consistently:
(import "stdlib/time.tur")
(import "stdlib/args.tur")
(defn benchmark [n] :void
; ... work under test ...
)
(defn main [] :void
(let [n (args/parse-int (args/get 1) 1000)
t0 (time/now-ns)]
(benchmark n)
(let [elapsed (- (time/now-ns) t0)]
(println (str/format "elapsed_ns={}" elapsed)))))
Pass problem size via *args* (never hardcode), so the runner can sweep
sizes without rebuilding.
Choose three input sizes:
| Label | Guideline | Rationale |
|---|---|---|
| small | < 1 ms wall time | Exercises startup path |
| medium | 100 ms -- 1 s | Exercises algorithm body |
| large | 5 -- 30 s | Exercises memory and cache behaviour |
Avoid inputs so large that a single run takes minutes; reproducibility suffers.
scripts/analyze_results.py computes speedup ratios relative to a C baseline
and writes results/processed/summary.json. Use it rather than eyeballing raw
numbers:
python3 scripts/analyze_results.py
The output normalises all timings to C = 1.0. A Turmeric ratio of 1.4 means Turmeric took 1.4× as long as C on that benchmark.
Before publishing numbers, run:
bash scripts/check_environment.sh
This prints exact compiler and runtime versions and hardware details. Always include this output alongside benchmark results so readers can reproduce.
Use this list before calling a hot path done:
just release (not just build)int, float, not inferred any)vec instead of list where random access or mutation is
neededstr/concat inside loops -- use str/buildertime and at least 5 iterations; CV < 10%