MediaTek AOT compilation via public API produces unoptimized bytecode (missing 19 MDLA flags vs Google's official litertlm)

Summary

Following the official AOT Compilation Tutorial and using the publicly released ai-edge-litert + ai-edge-litert-sdk-mediatek packages, the AOT-compiled output for MediaTek NPUs contains only --relax-fp32 as the compilation flag. In contrast, Google’s own pre-compiled models (e.g., Qwen3-0.6B.mediatek.mt6993.litertlm) embed 19+ MDLA optimization flags, resulting in ~153x better performance on device.

Environment

  • Host OS: Ubuntu 24.04, x86_64

  • Python: 3.12

  • Packages tested:

    • ai-edge-litert==2.1.3 (stable) + ai-edge-litert-sdk-mediatek==0.2.0 (stable)

    • ai-edge-litert-nightly==2.2.0.dev20260316 + ai-edge-litert-sdk-mediatek-nightly==0.2.0.dev20260309

  • Target device: Vivo V2509A — MT6993 / Dimensity 9500, Android 16

Reproduction Steps

from ai_edge_litert.aot.aot_compile import aot_compile
from ai_edge_litert.aot.vendors.mediatek import target as mtk_target

target = mtk_target.Target(mtk_target.SocModel.MT6993)
compiled = aot_compile("model.tflite", target=target, keep_going=True)

Compilation reports success — “1586 / 1586 ops offloaded to 1 partitions” — and produces a ~989MB output file.

The Problem

Inspecting the bytecode with strings reveals a critical difference:

Our AOT output (from public pip packages):

{"CompilerName": "adapter", "CompilerVersion": "9.2.1", "Neuron SHA1": "b3a8289961"}
--relax-fp32

Google’s official AOT (extracted from Qwen3-0.6B.mediatek.mt6993.litertlm):

{"CompilerName": "adapter", "CompilerVersion": "9.2.0", "Neuron SHA1": "6863b3d946"}
--relax-fp32 --show-exec-plan --mdla-mlo-match-info --mdla-flash-attention-mode 0 --opt 3 --opt-footprint --opt-accuracy --gno LTS,Inception --gno-exp --gno-non-4d-tiling --fc-to-conv --mdla-broadcast-act-wgt 1 --broadcast-flow-distance 63 --mdla-set-conv-xy-split-ic-threshold 99999 --mdla-mlo --mdla-conv-exp 1 --mem-opt 3 --stable-linearize --l1-size-kb 7168 --num-mdla 4

The pip-released adapter (v9.2.1, SHA b3a8289961) and Google’s internal adapter (v9.2.0, SHA 6863b3d946) are different binaries that produce different bytecode.

Verified Results

Variant Adapter Version Optimization Flags Output Size Device Performance
Public stable (2.1.3) 9.2.1 (b3a8289961) --relax-fp32 only 989 MB ~2900ms/token
Public nightly (2.2.0.dev) 9.2.1 (b3a8289961) --relax-fp32 only 989 MB same
Google official litertlm 9.2.0 (6863b3d946) 19+ MDLA flags 993 MB ~19ms/token
  • Stable and nightly produce identical results — same adapter SHA, same flags, same output size.

  • Tested across multiple SoCs (MT6878, MT6985, MT6989, MT6991, MT6993, MT6897, MT6877) — all only have --relax-fp32.

  • On device (MT6993), the host AOT bytecode from the public packages is rejected by the device NeuroPilot runtime (NeuronModel_restoreFromCompiledNetwork - Failed to load compiled network), silently falling back to JIT.

Additional Observations

  1. The apply_plugin_main binary accepts flags like --mediatek_enable_gemma_compiler_optimizations=true, --mediatek_performance_mode_type=turbo_boost, --mediatek_optimization_hint=low_latency, --mediatek_enable_l1_cache_optimizations=true via command line — but the underlying adapter binary does not apply them to the actual bytecode. The flags are silently ignored.

  2. Even wrapping libneuron_adapter.so to intercept and inject the 19 MDLA flags via NeuronCompilation_createWithOptions embeds them as metadata strings, but the v9.2.1 compiler does not actually use them during NPU code generation.

  3. The official AOT tutorial notebook only demonstrates small vision models (selfie segmentation) where the performance gap may not be noticeable. There is no documentation for LLM AOT compilation with optimization flags.

Questions

  1. Is there a way to pass MDLA optimization flags (e.g., --opt 3 --num-mdla 4 --l1-size-kb 7168) through the public Python API that actually affects bytecode generation?

  2. Will the adapter version used in the official .litertlm bundles (v9.2.0, SHA 6863b3d946) be released in a future pip package?

  3. Is on-device JIT the intended compilation path for external developers, with host AOT reserved for Google’s internal use?