Hi MediaTek Community,
I am working on deploying the Whisper (small) model using the MediaTek Neuropilot Basic SDK. Here is my setup and the issue I am facing:
Setup:
-
Separated the Whisper model into encoder and decoder.
-
Converted the encoder → TFLite → DLA using the Neuropilot SDK.
-
Kept the decoder in PyTorch running on CPU.
-
Ran a test audio file through the encoder (DLA) to generate output.bin.
-
Fed output.bin to the PyTorch decoder for transcription.
Observations / Issues:
-
The transcription produced is garbage / incorrect.
-
Comparing outputs for the same audio file:
-
The contents of the two outputs differ significantly, even after dequantization attempts.
-
Feeding either of these output.bin files to the decoder results in random/unintelligible transcription.
Questions / Help Needed:
-
Could the difference in output.bin size/content be due to quantization, layout, or NPU-specific optimizations in the SDK?
-
What is the recommended way to use DLA encoder output with a PyTorch decoder to obtain correct transcription?
-
How can we verify that the DLA encoder output matches the CPU TFLite output for the same audio input?
-
Are there any known issues or best practices for combining a DLA encoder with a PyTorch decoder in Whisper pipelines?
-
Do we need to convert the decoder to TFLite using the SDK as well, or can we use the original PyTorch decoder with the SDK-converted encoder?
-
Could the mismatch between the SDK-converted encoder and the original Whisper decoder be the reason for incorrect transcription?
Any guidance, examples, or recommended workflow would be highly appreciated.
Thank you!
Hi MediaTek Community,
I am working on deploying the Whisper-small model using the MediaTek Neuropilot Basic SDK. I previously shared a summary of my issue, and I would like to provide more detailed information about the methods I used and the challenges I am facing.
Setup Overview
-
The Whisper model is split into encoder (NPU) and decoder (PyTorch CPU).
-
Encoder is converted: PyTorch → TFLite → DLA using the Neuropilot SDK.
-
Decoder remains in PyTorch (FP32), running on CPU.
-
The encoder output (output.bin) is passed to the decoder for transcription.
Detailed Methods Tried
Method 1 (Quantized Encoder – Not Working)
-
Converted whisper encoder.pt to TFLite and INT8-quantized it using calibration data.
-
Converted the quantized TFLite model to DLA using the --opt-aggression flag.
-
Ran the DLA encoder on the NPU → generated output.bin.
-
Dequantized the encoder output and fed it to the FP32 PyTorch decoder.
Result:
Method 2 (FP32 Encoder – Working)
-
Converted whisper encoder.pt to TFLite without quantization (kept FP32).
-
Converted this FP32 TFLite model to DLA using --relax-fp32.
-
Ran the DLA encoder on the NPU → generated output.bin.
Result:
-
The output works correctly with the PyTorch decoder.
-
Transcription is accurate.
-
However, since the decoder runs on CPU, CPU usage is higher than expected.
Key Observations
Questions / Clarifications Needed
A. Regarding Quantized Encoder Failure
-
Why does the INT8 quantized encoder output not match the FP32/TFLite output?
-
Is Whisper’s encoder not INT8-friendly due to GELU, attention layers, or activation ranges?
-
Is post-training quantization insufficient for Whisper models?
-
Does the NPU require channel-wise quantization, dynamic ranges, or specific layout constraints for attention-based models?
-
Are there known limitations in the Neuropilot SDK for transformer models with multi-head attention when quantized?
-
Is the opt-aggression mode safe for large sequence models like Whisper?
Could it be altering precision too aggressively?
B. Regarding FP32 Pipeline & CPU Usage
-
Is there any recommended method to reduce CPU usage for the PyTorch decoder?
-
Does MediaTek provide any support for running decoder on NPU (even partially)?
For example:
-
Attention layers
-
Linear layers
-
KV caching acceleration
C. Pipeline Correctness & Best Practices
-
Is mixing a DLA encoder with a PyTorch decoder officially supported/recommended?
-
Should both encoder and decoder be converted with the SDK to ensure cross-compatibility?
-
How do we correctly validate that DLA encoder output matches TFLite FP32 output?
-
Are there any sample Whisper-like pipelines available for reference?
Summary
-
FP32 encoder (relax-fp32) works correctly.
-
INT8 encoder (opt-aggression) fails → leads to invalid decoder input.
-
CPU decoder leads to high CPU usage.
I would appreciate guidance on:
-
Why the quantized encoder fails,
-
How to reduce CPU usage for the decoder,
-
Recommended best practices for Whisper-style encoder–decoder pipelines.
Thank you!