Hi, I want to deploy TFLite on the Genio-520 and perform real-time inference on Genio-520 via an Android app or Python. Which framework would you recommend for converting to TFLite? Should I use official open-source TensorFlow converter for conversion? Or should I use the MTK Converter?
If I use the MTK converter for conversion, the resulting TFLite will contain MTK custom ops. Can these be used for online compilation? Or is offline compilation the only option?
I have reviewed the discussions in How to Generate a Platform-Compatible TFLite(.tflite) Model and What is the difference between Online Compilation and Offline Compilation. I would like to confirm which approach you all prefer.
Thanks!
Thank you for this question.
Genio-520 belongs to the NP8 generation and supports both online inference (on-device compilation) and offline inference (host-side compilation using ncc-tflite). The choice depends on your runtime pattern and performance requirements.
1. Online vs. Offline Inference on Genio-520
If your application runs real-time inference but each model is executed only a few times (for example, short-lived tools, occasional inference, or low-duty-cycle tasks):
- Here recommends offline inference.
- Reason: The first-time model compilation cost in online inference can be significant and is model-dependent. Paying that cost for a small number of inferences is often inefficient.
- Please note that offline inference doesn’t have CPU fallback, all the OPs should be supported on NPU if you are like to deploy the model on NPU.
If your application is long-running and performs many inferences after startup (for example, an always-on service or camera pipeline):
- Online inference is also a valid option.
- The model is compiled once at initialization; subsequent inferences reuse the compiled graph and do not pay the compilation cost again.
- For Android apps, this is often the natural integration path (through Neuron Delegate / Shim API).
2. Using MTK Converter vs. Official TensorFlow Converter
A practical workflow for Genio-520 is:
-
Start with the official TensorFlow converter
- Use the official open-source TensorFlow TFLite converter to:
- Confirm that your model is compatible with standard TFLite.
- Ensure all operators are supported by the TFLite runtime.
- This step gives you a “baseline” TFLite model that should run on CPU (and possibly GPU) with the standard TFLite interpreter.
-
Then apply the MTK Converter for NPU optimization
- Once you have confirmed that the model works in standard TFLite, you can use the MTK Converter (NP Converter) to generate a TFLite model enhanced with MediaTek custom operations.
- These MTK custom ops are designed to better match the NPU (MDLA) capabilities and can provide:
- Better NPU coverage.
- Higher performance and more efficient acceleration.
This two-step approach ensures:
- Functional correctness is first validated with the official converter.
- Then NPU-optimized variants are produced via MTK Converter for deployment on Genio-520.
3. Can MTK Custom-Op TFLite Models Be Used for Online Compilation?
Yes, on NP8 platforms, TFLite models with MTK custom ops can be used with online inference, with the following behavior:
In both OS environments, the goal is:
- MTK custom ops → NPU (MDLA).
- Non-supported ops (but supported by TFLite) → CPU fallback.
4. Recommended Approach Summary
For conversion:
- Use the official TensorFlow converter first:
- Validate TFLite compatibility and basic correctness.
- Then use the MTK Converter (NP Converter):
- Generate NPU-friendly TFLite with MTK custom ops for Genio-520.
- This can improve coverage and performance on the NPU.
For runtime execution:
- Short-lived / low-count inference workloads:
- Prefer offline inference (host-side compilation → DLA →
neuronrt on device).
- Long-running / high-throughput applications:
- Online inference is acceptable:
- Yocto:
stable_delegate via C/C++ TFLite APIs (NPU + CPU fallback).
- Android: Shim API + Neuron Delegate in the Android app.
Important constraints:
- On Yocto, Python-based workflows can currently use only CPU/GPU (no stable delegate for NPU from Python).
- For NPU acceleration and MTK custom ops, use C/C++ on Yocto or Shim API / Neuron Delegate on Android.
- Online inference does not have CPU fallback.