I am working on deploying the Whisper-small model using the MediaTek Neuropilot Basic SDK. I previously shared a summary of my issue, and I would like to provide more detailed information about the methods I used and the challenges I am facing.
Setup Overview
The Whisper model is split into encoder (NPU) and decoder (PyTorch CPU).
Encoder is converted: PyTorch → TFLite → DLA using the Neuropilot SDK.
Decoder remains in PyTorch (FP32), running on CPU.
The encoder output (output.bin) is passed to the decoder for transcription.
Detailed Methods Tried
Method 1 (Quantized Encoder – Not Working)
Converted whisper encoder.pt to TFLite and INT8-quantized it using calibration data.
Converted the quantized TFLite model to DLA using the --opt-aggression flag.
Ran the DLA encoder on the NPU → generated output.bin.
Dequantized the encoder output and fed it to the FP32 PyTorch decoder.
Result:
Transcription is completely incorrect (garbage text).
The encoder output does not match the FP32 reference output.
Method 2 (FP32 Encoder – Working)
Converted whisper encoder.pt to TFLite without quantization (kept FP32).
Converted this FP32 TFLite model to DLA using --relax-fp32.
Ran the DLA encoder on the NPU → generated output.bin.
Result:
The output works correctly with the PyTorch decoder.
Transcription is accurate.
However, since the decoder runs on CPU, CPU usage is higher than expected.
Key Observations
Feeding INT8 encoder output (even after dequantization) leads to wrong transcription.
FP32 encoder → accurate output → suggests the NPU pipeline is manually correct.
Questions / Clarifications Needed
A. Regarding Quantized Encoder Failure
Why does the INT8 quantized encoder output not match the FP32/TFLite output?
Is Whisper’s encoder not INT8-friendly due to GELU, attention layers, or activation ranges?
Is post-training quantization insufficient for Whisper models?
Does the NPU require channel-wise quantization, dynamic ranges, or specific layout constraints for attention-based models?
Are there known limitations in the Neuropilot SDK for transformer models with multi-head attention when quantized?
Is the opt-aggression mode safe for large sequence models like Whisper?
Could it be altering precision too aggressively?
B. Regarding FP32 Pipeline & CPU Usage
Is there any recommended method to reduce CPU usage for the PyTorch decoder?
Suggested threading settings?
Inference optimizations?
ONNX/TFLite/MDLA decoder support planned?
Does MediaTek provide any support for running decoder on NPU (even partially)?
For example:
Attention layers
Linear layers
KV caching acceleration
C. Pipeline Correctness & Best Practices
Is mixing a DLA encoder with a PyTorch decoder officially supported/recommended?
Should both encoder and decoder be converted with the SDK to ensure cross-compatibility?
How do we correctly validate that DLA encoder output matches TFLite FP32 output?
Tools?
Debug flags?
Supported tolerance ranges?
Are there any sample Whisper-like pipelines available for reference?
Summary
FP32 encoder (relax-fp32) works correctly.
INT8 encoder (opt-aggression) fails → leads to invalid decoder input.
CPU decoder leads to high CPU usage.
I would appreciate guidance on:
Why the quantized encoder fails,
How to reduce CPU usage for the decoder,
Recommended best practices for Whisper-style encoder–decoder pipelines.
I’m also trying to accelerate Whisper on Genio. With Yocto v25.1 including ONNX Runtime + NeuronExecutionProvider, is the recommended path to export Whisper (encoder-only or full pipeline) to ONNX and run it via ONNX Runtime/Neuron EP, rather than using the offline TFLite → DLA flow?
Yes, I’m also exploring different acceleration paths for Whisper on Genio. If you’re considering the ONNX Runtime + NeuronExecutionProvider route from Yocto v25.1, that sounds very interesting.
If possible, could you please share a bit more detail on:
Whether you’re exporting only the encoder or the full Whisper pipeline to ONNX
Which opset version you’re using
Any model modifications needed before export.
Whether you observed better stability compared to the offline TFLite → DLA flow
I’d really appreciate any insights on the conversion and runtime setup steps you’re following.