Benchmarks

How to reproduce internal and external AIXI comparisons.

Benchmark claims should be reproducible before they are persuasive.

External baselines in scope

The current benchmark documentation explicitly tracks two external reference implementations:

  • PyAIXI, used as a Python baseline for MC-AIXI-style experiments [pyaixi_repo].
  • The older MC-AIXI C++ codebase, used as a native AC-CTW reference point [mcaixi_cpp_repo].

Those are baselines, not bundled dependencies. The repository keeps them external and drives them through harness scripts so the comparison remains inspectable instead of silently vendored.

1. In-repo AIQI vs MC-AIXI

For quick relative comparisons inside this repository, use:

scripts/bench_aiqi_vs_aixi.sh

This harness is intentionally narrow. It is for comparing the two planner paths in the current repo, not for reproducing the AIQI paper against external implementations.

2. Compare against PyAIXI

If you already have a PyAIXI clone, run:

scripts/bench_aixi_vs_pyaixi.sh --pyaixi-root /path/to/pyaixi

This path is useful when you want a direct comparison between the current infotheory MC-AIXI stack and the Python reference implementation without bringing in the full Guix-pinned multi-implementation harness.

3. Reproducible multi-implementation runs

For the heavier, more reproducible comparison across:

  • infotheory Rust CLI,
  • infotheory Python bindings,
  • PyAIXI,
  • and MC-AIXI C++,

use the Guix wrapper:

scripts/bench_aixi_competitors_guix.sh --trials 3 --profile default

The script writes plot-ready TSV files under target/aixi-competitors/<timestamp>/, records the commands used, and pins the external competitor repositories by commit for the run.

Comparison notes

The runner distinguishes between the directly comparable AC-CTW path and the repository's FAC-CTW variants. That distinction matters: FAC-CTW is informative and often useful, but it is not a like-for-like drop-in for older AC-CTW reference implementations [veness2011_mcaixi].

The benchmark runner also normalizes reward reporting for Kuhn Poker because some older implementations print encoded reward symbols rather than native-domain rewards. That normalization step is part of the comparison logic rather than an after-the-fact spreadsheet fix.