D9300 / Gemma 4 E2B — 35-layer single-shard compile fails NEURON_UNMAPPABLE; multi-shard LiteRT-LM not supported; needed compile flags missing from public API

Hi MediaTek team — sharing a reproducible blocker for shipping Gemma 4 E2B (or any ~2B-class decoder) to D9300 (Dimensity 9300) users via Google AI Edge Gallery, after several days of empirical work. Filing as a fresh thread since #1907 (related Public-API optimization-flag gap) is closed.

Setup

  • Retail D9300 firmware (NeuroPilot 8.x runtime)

  • mtk_converter 8.13.0 (Public) + TFLite Shim API

  • Custom runner binary built against the public NeuroPilotTFLiteShim.h from the Native Sample

  • Gemma 4 E2B (text-only fork, 35 transformer layers, hidden=1536, A16W{4,8} PTQ)

Empirical results

Single-shard decoder compile budget:

  • 18-layer single-shard A16W8 per-channel → compiles cleanly, runs ~750 ms/invoke at seq_len=16.

  • 35-layer single-shard at A16W8 per-channel, A16W4 per-tensor, or A16W8 per-tensor → all fail with NEURON_UNMAPPABLE @ line 1514 at final NeuronCompilation_finish. Op count ~4500–5000 in each case. The failure is op-count-driven, not weight-byte-driven — halving weight bytes (W8->W4) does not unlock compile.

Per-channel int4 runtime kernel absent:

  • A16W4 per-channel rejected at any size, including 2-layer probes.

  • nm/strings on /vendor/lib64/libtflite_mtk.so shows EvalQuantizedPerChannel16x8 (A16W8) but no EvalQuantizedPerChannel*4* equivalent.

  • Binary strings table contains the literal “Currently only hybrid, int8 and int16 quantization are supported.”

  • Matches Qualcomm’s published gemma-4-E2B-it_qualcomm_sm8750.litertlm bundle (~3 GB), which uses per-channel int4 — a path apparently unavailable on the D9300 public stack today.

Compile-flag injection via setCompileOptionByString does not unlock compile (per #1907’s claim about public API not wiring these through):

Compile-options string Result
--opt 3 (default) NEURON_UNMAPPABLE
--opt 3 --mdla-mlo NEURON_UNMAPPABLE
--opt 3 --mem-opt 3 NEURON_UNMAPPABLE
--opt 3 --stable-linearize NEURON_UNMAPPABLE
--opt 3 --mdla-mlo --mem-opt 3 --stable-linearize NEURON_UNMAPPABLE
--opt 3 --mdla-mlo --mdla-conv-exp 1 --mem-opt 3 --stable-linearize NEURON_UNMAPPABLE

Multi-shard isn’t a viable workaround for Gallery:

  • We can compile + chain two 17/18-layer shards on-device manually (~10 tok/s prefill measured).

  • But upstream google-ai-edge/LiteRT-LM (ModelResourcesLitertLm::GetTFLiteModel) keys model_map_ by a single ModelType, accepting one PREFILL_DECODE section per bundle. Google AI Edge Gallery uses LiteRT-LM, so a multi-section .litertlm won’t run there without upstream runtime changes.

Concrete asks

  1. Is the GAI-Deployment-Toolkit + compile_generative.sh (with BACKEND=mdla5.3,edma3.6, NUM_MDLA, L1_SIZE_KB, referenced in #2275 and #1966) accessible to community developers? Public NeuroPilot has no download link.

  2. Will the adapter v9.2.0 that produces the optimized bytecode visible in Google’s litert-community/gemma-4-E2B-it-litert-lm Qualcomm/Intel bundles be released as a pip wheel for MediaTek targets?

  3. Is NUM_MDLA > 1 (multi-MDLA-partition compile) intended for fitting >18-layer LLM decoders single-shard on D9300/D9400, or is the recommended path for large models the DLA Muxer + multi-.dla workflow (and is DLA Muxer accessible to community devs)?

  4. Are there plans to add a per-channel A16W4 runtime kernel to the D9300/D9400 NeuroPilot delegate?

Happy to share a minimal repro bundle + reproduction commands for triage. Net effect of these gaps: a community developer cannot today ship a Gemma 4 E2B model that runs on D9300 NPU via Google AI Edge Gallery, even though Qualcomm SoC owners get the equivalent experience from Google’s published litert-community bundles.

Thanks.