NPU Deployment Issue — Whisper Model (Genio 510)

Hi,
I’m trying to deploy the Whisper model on the MediaTek Genio 510 NPU (MDLA), and I’m stuck at the final step. Below is the exact workflow I followed and where I’m blocked.

Steps Completed So Far

  1. Downloaded the Whisper PyTorch model (.pt) and converted the encoder to TFLite using mtk_pytorch_converter.

  2. Could not convert both encoder + decoder into a single TFLite file (as expected, Whisper decoder is not supported for full TFLite conversion).

  3. Quantized the TFLite encoder using calibration data with mtk_pytorch_converter.

  4. Generated the .dla model using:
    ncc-tflite --arch=mdla3.0

  5. Now I have a Whisper encoder in DLA format and I want to execute it on the Genio 510 board’s NPU.

Where I’m Stuck — Step 5

I don’t know how to actually run the .dla file on the Genio 510 board.

My Questions

1) To run a .dla model on the MDLA NPU, do I need to use the C++ Runtime API from the SDK?

The only available API I see in the SDK is the NeuroPilot Runtime API, which seems to be C++ oriented.

2) Can I run the DLA model using Python?

Or is Python completely unsupported for MDLA inference?
If Python is not possible, do I need to move my entire encoder-inference logic to C++ using the runtime SDK?

Summary of What I Want

I want to load and run my Whisper encoder DLA model on the Genio 510 NPU.
I need guidance on how to execute a .dla file — using Python or C++, and what the recommended method is.

If you have any suggestions or a proper workflow, I would be truly grateful.

Thanks for the detailed description of your workflow and setup.

Let me address your questions first:

  1. Do I need to use the C++ Runtime API from the SDK to run a .dla model on the MDLA NPU?
    Yes. For application-level integration on MDLA today, the supported way is to use the NeuroPilot Runtime API (C/C++).

  2. Can I run the DLA model using Python?
    Currently, MDLA inference is not supported through a Python API. So you cannot directly load and run the .dla model from Python at this time.


Recommended approach right now:

  • For analyzing model behavior and performance on hardware, you can use the neuronrt tool to:

    • Inspect how the model is mapped to the NPU,
    • Check performance characteristics and utilization on the Genio 510.
  • For development and integration, please refer to the NeuroPilot Runtime API (C/C++). That is the recommended way to:

    • Load your Whisper encoder .dla file,
    • Configure inputs/outputs,
    • Run inference on the MDLA NPU.

We plan to update and provide a Stable Delegate with Python API support. Once that is available, you will be able to perform NPU inference (including MDLA) directly from Python, which should make integration with Python-based workflows much easier.

Future updates will be published on the bulletin.