Which method should be chosen to convert platform-compatible TFLite (.tflite) models?

Hi, I want to deploy TFLite on the Genio-520 and perform real-time inference on Genio-520 via an Android app or Python. Which framework would you recommend for converting to TFLite? Should I use official open-source TensorFlow converter for conversion? Or should I use the MTK Converter?

If I use the MTK converter for conversion, the resulting TFLite will contain MTK custom ops. Can these be used for online compilation? Or is offline compilation the only option?

I have reviewed the discussions in How to Generate a Platform-Compatible TFLite(.tflite) Model and What is the difference between Online Compilation and Offline Compilation. I would like to confirm which approach you all prefer.

Thanks!

Thank you for this question.

Genio-520 belongs to the NP8 generation and supports both online inference (on-device compilation) and offline inference (host-side compilation using ncc-tflite). The choice depends on your runtime pattern and performance requirements.


1. Online vs. Offline Inference on Genio-520

If your application runs real-time inference but each model is executed only a few times (for example, short-lived tools, occasional inference, or low-duty-cycle tasks):

  • Here recommends offline inference.
  • Reason: The first-time model compilation cost in online inference can be significant and is model-dependent. Paying that cost for a small number of inferences is often inefficient.
  • Please note that offline inference doesn’t have CPU fallback, all the OPs should be supported on NPU if you are like to deploy the model on NPU.

If your application is long-running and performs many inferences after startup (for example, an always-on service or camera pipeline):

  • Online inference is also a valid option.
  • The model is compiled once at initialization; subsequent inferences reuse the compiled graph and do not pay the compilation cost again.
  • For Android apps, this is often the natural integration path (through Neuron Delegate / Shim API).

2. Using MTK Converter vs. Official TensorFlow Converter

A practical workflow for Genio-520 is:

  1. Start with the official TensorFlow converter

    • Use the official open-source TensorFlow TFLite converter to:
      • Confirm that your model is compatible with standard TFLite.
      • Ensure all operators are supported by the TFLite runtime.
    • This step gives you a “baseline” TFLite model that should run on CPU (and possibly GPU) with the standard TFLite interpreter.
  2. Then apply the MTK Converter for NPU optimization

    • Once you have confirmed that the model works in standard TFLite, you can use the MTK Converter (NP Converter) to generate a TFLite model enhanced with MediaTek custom operations.
    • These MTK custom ops are designed to better match the NPU (MDLA) capabilities and can provide:
      • Better NPU coverage.
      • Higher performance and more efficient acceleration.

This two-step approach ensures:

  • Functional correctness is first validated with the official converter.
  • Then NPU-optimized variants are produced via MTK Converter for deployment on Genio-520.

3. Can MTK Custom-Op TFLite Models Be Used for Online Compilation?

Yes, on NP8 platforms, TFLite models with MTK custom ops can be used with online inference, with the following behavior:

  • Yocto (Genio-520, NP8):

    • stable_delegate can execute MTK custom ops on the NPU (MDLA) during online inference.
    • Other operators that:
      • Are not supported by the NPU, but
      • Are implemented by official TFLite CPU kernels
        will fall back to CPU.
    • Language support:
      • stable_delegate currently provides C/C++ integration (aligned with TensorFlow’s official C/C++ TFLite APIs).
      • The Python TFLite APIs on Yocto currently support CPU/GPU only; they do not expose stable_delegate for NPU acceleration.
  • Android (Genio-520, NP8):

    • TFLite models containing MTK custom ops can be deployed in Android apps via TFLite Shim API + Neuron Delegate.
    • MTK custom ops are scheduled ideally to MDLA.
    • Operators unsupported by NPU but supported by TFLite CPU kernels will fall back to CPU.
    • This path is suitable for online compilation within Android apps.

In both OS environments, the goal is:

  • MTK custom ops → NPU (MDLA).
  • Non-supported ops (but supported by TFLite) → CPU fallback.

4. Recommended Approach Summary

For conversion:

  • Use the official TensorFlow converter first:
    • Validate TFLite compatibility and basic correctness.
  • Then use the MTK Converter (NP Converter):
    • Generate NPU-friendly TFLite with MTK custom ops for Genio-520.
    • This can improve coverage and performance on the NPU.

For runtime execution:

  • Short-lived / low-count inference workloads:
    • Prefer offline inference (host-side compilation → DLA → neuronrt on device).
  • Long-running / high-throughput applications:
    • Online inference is acceptable:
      • Yocto: stable_delegate via C/C++ TFLite APIs (NPU + CPU fallback).
      • Android: Shim API + Neuron Delegate in the Android app.

Important constraints:

  • On Yocto, Python-based workflows can currently use only CPU/GPU (no stable delegate for NPU from Python).
  • For NPU acceleration and MTK custom ops, use C/C++ on Yocto or Shim API / Neuron Delegate on Android.
  • Online inference does not have CPU fallback.