Garbage transcription when using Whisper encoder converted via Neuropilot SDK (DLA) with PyTorch decoder

Hi MediaTek Community,

I am working on deploying the Whisper (small) model using the MediaTek Neuropilot Basic SDK. Here is my setup and the issue I am facing:

Setup:

  1. Separated the Whisper model into encoder and decoder.

  2. Converted the encoder → TFLite → DLA using the Neuropilot SDK.

  3. Kept the decoder in PyTorch running on CPU.

  4. Ran a test audio file through the encoder (DLA) to generate output.bin.

  5. Fed output.bin to the PyTorch decoder for transcription.

Observations / Issues:

  • The transcription produced is garbage / incorrect.

  • Comparing outputs for the same audio file:

    • Encoder TFLite (CPU) output.bin → size ~4.6 MB

    • Encoder DLA (NPU) output.bin → size ~1.2 MB

  • The contents of the two outputs differ significantly, even after dequantization attempts.

  • Feeding either of these output.bin files to the decoder results in random/unintelligible transcription.

Questions / Help Needed:

  1. Could the difference in output.bin size/content be due to quantization, layout, or NPU-specific optimizations in the SDK?

  2. What is the recommended way to use DLA encoder output with a PyTorch decoder to obtain correct transcription?

  3. How can we verify that the DLA encoder output matches the CPU TFLite output for the same audio input?

  4. Are there any known issues or best practices for combining a DLA encoder with a PyTorch decoder in Whisper pipelines?

  5. Do we need to convert the decoder to TFLite using the SDK as well, or can we use the original PyTorch decoder with the SDK-converted encoder?

  6. Could the mismatch between the SDK-converted encoder and the original Whisper decoder be the reason for incorrect transcription?

Any guidance, examples, or recommended workflow would be highly appreciated.

Thank you!

Hi MediaTek Community,

I am working on deploying the Whisper-small model using the MediaTek Neuropilot Basic SDK. I previously shared a summary of my issue, and I would like to provide more detailed information about the methods I used and the challenges I am facing.


Setup Overview

  • The Whisper model is split into encoder (NPU) and decoder (PyTorch CPU).

  • Encoder is converted: PyTorch → TFLite → DLA using the Neuropilot SDK.

  • Decoder remains in PyTorch (FP32), running on CPU.

  • The encoder output (output.bin) is passed to the decoder for transcription.


Detailed Methods Tried

Method 1 (Quantized Encoder – Not Working)

  1. Converted whisper encoder.pt to TFLite and INT8-quantized it using calibration data.

  2. Converted the quantized TFLite model to DLA using the --opt-aggression flag.

  3. Ran the DLA encoder on the NPU → generated output.bin.

  4. Dequantized the encoder output and fed it to the FP32 PyTorch decoder.

Result:

  • Transcription is completely incorrect (garbage text).

  • The encoder output does not match the FP32 reference output.


Method 2 (FP32 Encoder – Working)

  1. Converted whisper encoder.pt to TFLite without quantization (kept FP32).

  2. Converted this FP32 TFLite model to DLA using --relax-fp32.

  3. Ran the DLA encoder on the NPU → generated output.bin.

Result:

  • The output works correctly with the PyTorch decoder.

  • Transcription is accurate.

  • However, since the decoder runs on CPU, CPU usage is higher than expected.


Key Observations

  • Feeding INT8 encoder output (even after dequantization) leads to wrong transcription.

  • FP32 encoder → accurate output → suggests the NPU pipeline is manually correct.


Questions / Clarifications Needed

A. Regarding Quantized Encoder Failure

  1. Why does the INT8 quantized encoder output not match the FP32/TFLite output?

    • Is Whisper’s encoder not INT8-friendly due to GELU, attention layers, or activation ranges?

    • Is post-training quantization insufficient for Whisper models?

  2. Does the NPU require channel-wise quantization, dynamic ranges, or specific layout constraints for attention-based models?

  3. Are there known limitations in the Neuropilot SDK for transformer models with multi-head attention when quantized?

  4. Is the opt-aggression mode safe for large sequence models like Whisper?
    Could it be altering precision too aggressively?


B. Regarding FP32 Pipeline & CPU Usage

  1. Is there any recommended method to reduce CPU usage for the PyTorch decoder?

    • Suggested threading settings?

    • Inference optimizations?

    • ONNX/TFLite/MDLA decoder support planned?

  2. Does MediaTek provide any support for running decoder on NPU (even partially)?
    For example:

    • Attention layers

    • Linear layers

    • KV caching acceleration


C. Pipeline Correctness & Best Practices

  1. Is mixing a DLA encoder with a PyTorch decoder officially supported/recommended?

  2. Should both encoder and decoder be converted with the SDK to ensure cross-compatibility?

  3. How do we correctly validate that DLA encoder output matches TFLite FP32 output?

    • Tools?

    • Debug flags?

    • Supported tolerance ranges?

  4. Are there any sample Whisper-like pipelines available for reference?


Summary

  • FP32 encoder (relax-fp32) works correctly.

  • INT8 encoder (opt-aggression) fails → leads to invalid decoder input.

  • CPU decoder leads to high CPU usage.

I would appreciate guidance on:

  • Why the quantized encoder fails,

  • How to reduce CPU usage for the decoder,

  • Recommended best practices for Whisper-style encoder–decoder pipelines.

Thank you!