Hi MediaTek team — sharing a reproducible blocker for shipping Gemma 4 E2B (or any ~2B-class decoder) to D9300 (Dimensity 9300) users via Google AI Edge Gallery, after several days of empirical work. Filing as a fresh thread since #1907 (related Public-API optimization-flag gap) is closed.
Setup
-
Retail D9300 firmware (NeuroPilot 8.x runtime)
-
mtk_converter8.13.0 (Public) + TFLite Shim API -
Custom
runnerbinary built against the publicNeuroPilotTFLiteShim.hfrom the Native Sample -
Gemma 4 E2B (text-only fork, 35 transformer layers, hidden=1536, A16W{4,8} PTQ)
Empirical results
Single-shard decoder compile budget:
-
18-layer single-shard A16W8 per-channel → compiles cleanly, runs ~750 ms/invoke at seq_len=16.
-
35-layer single-shard at A16W8 per-channel, A16W4 per-tensor, or A16W8 per-tensor → all fail with
NEURON_UNMAPPABLE @ line 1514at finalNeuronCompilation_finish. Op count ~4500–5000 in each case. The failure is op-count-driven, not weight-byte-driven — halving weight bytes (W8->W4) does not unlock compile.
Per-channel int4 runtime kernel absent:
-
A16W4 per-channel rejected at any size, including 2-layer probes.
-
nm/stringson/vendor/lib64/libtflite_mtk.soshowsEvalQuantizedPerChannel16x8(A16W8) but noEvalQuantizedPerChannel*4*equivalent. -
Binary strings table contains the literal “Currently only hybrid, int8 and int16 quantization are supported.”
-
Matches Qualcomm’s published
gemma-4-E2B-it_qualcomm_sm8750.litertlmbundle (~3 GB), which uses per-channel int4 — a path apparently unavailable on the D9300 public stack today.
Compile-flag injection via setCompileOptionByString does not unlock compile (per #1907’s claim about public API not wiring these through):
| Compile-options string | Result |
|---|---|
--opt 3 (default) |
NEURON_UNMAPPABLE |
--opt 3 --mdla-mlo |
NEURON_UNMAPPABLE |
--opt 3 --mem-opt 3 |
NEURON_UNMAPPABLE |
--opt 3 --stable-linearize |
NEURON_UNMAPPABLE |
--opt 3 --mdla-mlo --mem-opt 3 --stable-linearize |
NEURON_UNMAPPABLE |
--opt 3 --mdla-mlo --mdla-conv-exp 1 --mem-opt 3 --stable-linearize |
NEURON_UNMAPPABLE |
Multi-shard isn’t a viable workaround for Gallery:
-
We can compile + chain two 17/18-layer shards on-device manually (~10 tok/s prefill measured).
-
But upstream
google-ai-edge/LiteRT-LM(ModelResourcesLitertLm::GetTFLiteModel) keysmodel_map_by a singleModelType, accepting onePREFILL_DECODEsection per bundle. Google AI Edge Gallery uses LiteRT-LM, so a multi-section.litertlmwon’t run there without upstream runtime changes.
Concrete asks
-
Is the GAI-Deployment-Toolkit +
compile_generative.sh(withBACKEND=mdla5.3,edma3.6,NUM_MDLA,L1_SIZE_KB, referenced in #2275 and #1966) accessible to community developers? Public NeuroPilot has no download link. -
Will the adapter v9.2.0 that produces the optimized bytecode visible in Google’s
litert-community/gemma-4-E2B-it-litert-lmQualcomm/Intel bundles be released as a pip wheel for MediaTek targets? -
Is
NUM_MDLA > 1(multi-MDLA-partition compile) intended for fitting >18-layer LLM decoders single-shard on D9300/D9400, or is the recommended path for large models the DLA Muxer + multi-.dlaworkflow (and is DLA Muxer accessible to community devs)? -
Are there plans to add a per-channel A16W4 runtime kernel to the D9300/D9400 NeuroPilot delegate?
Happy to share a minimal repro bundle + reproduction commands for triage. Net effect of these gaps: a community developer cannot today ship a Gemma 4 E2B model that runs on D9300 NPU via Google AI Edge Gallery, even though Qualcomm SoC owners get the equivalent experience from Google’s published litert-community bundles.
Thanks.