Summary
Following the official AOT Compilation Tutorial and using the publicly released ai-edge-litert + ai-edge-litert-sdk-mediatek packages, the AOT-compiled output for MediaTek NPUs contains only --relax-fp32 as the compilation flag. In contrast, Google’s own pre-compiled models (e.g., Qwen3-0.6B.mediatek.mt6993.litertlm) embed 19+ MDLA optimization flags, resulting in ~153x better performance on device.
Environment
-
Host OS: Ubuntu 24.04, x86_64
-
Python: 3.12
-
Packages tested:
-
ai-edge-litert==2.1.3(stable) +ai-edge-litert-sdk-mediatek==0.2.0(stable) -
ai-edge-litert-nightly==2.2.0.dev20260316+ai-edge-litert-sdk-mediatek-nightly==0.2.0.dev20260309
-
-
Target device: Vivo V2509A — MT6993 / Dimensity 9500, Android 16
Reproduction Steps
from ai_edge_litert.aot.aot_compile import aot_compile
from ai_edge_litert.aot.vendors.mediatek import target as mtk_target
target = mtk_target.Target(mtk_target.SocModel.MT6993)
compiled = aot_compile("model.tflite", target=target, keep_going=True)
Compilation reports success — “1586 / 1586 ops offloaded to 1 partitions” — and produces a ~989MB output file.
The Problem
Inspecting the bytecode with strings reveals a critical difference:
Our AOT output (from public pip packages):
{"CompilerName": "adapter", "CompilerVersion": "9.2.1", "Neuron SHA1": "b3a8289961"}
--relax-fp32
Google’s official AOT (extracted from Qwen3-0.6B.mediatek.mt6993.litertlm):
{"CompilerName": "adapter", "CompilerVersion": "9.2.0", "Neuron SHA1": "6863b3d946"}
--relax-fp32 --show-exec-plan --mdla-mlo-match-info --mdla-flash-attention-mode 0 --opt 3 --opt-footprint --opt-accuracy --gno LTS,Inception --gno-exp --gno-non-4d-tiling --fc-to-conv --mdla-broadcast-act-wgt 1 --broadcast-flow-distance 63 --mdla-set-conv-xy-split-ic-threshold 99999 --mdla-mlo --mdla-conv-exp 1 --mem-opt 3 --stable-linearize --l1-size-kb 7168 --num-mdla 4
The pip-released adapter (v9.2.1, SHA b3a8289961) and Google’s internal adapter (v9.2.0, SHA 6863b3d946) are different binaries that produce different bytecode.
Verified Results
| Variant | Adapter Version | Optimization Flags | Output Size | Device Performance |
|---|---|---|---|---|
| Public stable (2.1.3) | 9.2.1 (b3a8289961) |
--relax-fp32 only |
989 MB | ~2900ms/token |
| Public nightly (2.2.0.dev) | 9.2.1 (b3a8289961) |
--relax-fp32 only |
989 MB | same |
| Google official litertlm | 9.2.0 (6863b3d946) |
19+ MDLA flags | 993 MB | ~19ms/token |
-
Stable and nightly produce identical results — same adapter SHA, same flags, same output size.
-
Tested across multiple SoCs (MT6878, MT6985, MT6989, MT6991, MT6993, MT6897, MT6877) — all only have
--relax-fp32. -
On device (MT6993), the host AOT bytecode from the public packages is rejected by the device NeuroPilot runtime (
NeuronModel_restoreFromCompiledNetwork - Failed to load compiled network), silently falling back to JIT.
Additional Observations
-
The
apply_plugin_mainbinary accepts flags like--mediatek_enable_gemma_compiler_optimizations=true,--mediatek_performance_mode_type=turbo_boost,--mediatek_optimization_hint=low_latency,--mediatek_enable_l1_cache_optimizations=truevia command line — but the underlying adapter binary does not apply them to the actual bytecode. The flags are silently ignored. -
Even wrapping
libneuron_adapter.soto intercept and inject the 19 MDLA flags viaNeuronCompilation_createWithOptionsembeds them as metadata strings, but the v9.2.1 compiler does not actually use them during NPU code generation. -
The official AOT tutorial notebook only demonstrates small vision models (selfie segmentation) where the performance gap may not be noticeable. There is no documentation for LLM AOT compilation with optimization flags.
Questions
-
Is there a way to pass MDLA optimization flags (e.g.,
--opt 3 --num-mdla 4 --l1-size-kb 7168) through the public Python API that actually affects bytecode generation? -
Will the adapter version used in the official
.litertlmbundles (v9.2.0, SHA6863b3d946) be released in a future pip package? -
Is on-device JIT the intended compilation path for external developers, with host AOT reserved for Google’s internal use?