Accelerating AI on Genio with the ONNX Runtime NeuronExecutionProvider

Starting PR4, Our Genio Yocto builds now include built-in support for ONNX Runtime (ORT), making it easier than ever to deploy high-performance AI models.

We currently support ORT version 1.20.2 with the following execution providers (EPs):

  • CPUExecutionProvider

  • XnnpackExecutionProvider

  • NeuronExecutionProvider

Introducing the NeuronExecutionProvider

The NeuronExecutionProvider is the official ONNX Runtime execution provider for leveraging the power of the MediaTek NPU. When you use this EP, it intelligently deploys your model to the NPU, and any unsupported operations automatically fall back to the CPU for seamless execution. It supports both the Python and C++ APIs.

Platform Availability

  • Genio 720 / Genio 520: The NeuronExecutionProvider is available by default.

  • All other Genio products: Have default access to CPUExecutionProvider and XnnpackExecutionProvider.

  • Genio 700 / Genio 510: To enable the NeuronExecutionProvider, please contact your MediaTek representative for more information.

Getting Started: Python API Example

Here is a simple example of how to initialize an ONNX Runtime session with the NeuronExecutionProvider on Genio 720/520:

#Dummy Python Example Script for inferencing FP32 ONNX model

import onnxruntime as ort
import numpy as np

# Path to your ONNX model
model_path = "your_model.onnx"

# Neuron provider options (set as needed)
neuron_provider_options = {
    "NEURON_FLAG_USE_FP16": "1",           # Allow FP32 to execute as FP16
    "NEURON_FLAG_MIN_GROUP_SIZE": "1"      # Set minimum subgraph size
}

# Create session with NeuronExecutionProvider and options
providers = [("NeuronExecutionProvider", neuron_provider_options), "XnnpackExecutionProvider", "CPUExecutionProvider"]
session = ort.InferenceSession(model_path, providers=providers)

# Generate dummy input data based on model input shape and type
def generate_dummy_inputs(session):
    inputs = {}
    for input_meta in session.get_inputs():
        shape = [dim if isinstance(dim, int) else 1 for dim in input_meta.shape]
        data = np.random.randn(*shape).astype(np.float32)
        inputs[input_meta.name] = data
    return inputs

# Prepare inputs and run inference
dummy_inputs = generate_dummy_inputs(session)
output_names = [output.name for output in session.get_outputs()]
outputs = session.run(output_names, dummy_inputs)

print("Inference outputs:", outputs)


The NeuronExecutionProvider has two important provider options:

  • NEURON_FLAG_USE_FP16: Executes FP32 models in FP16 precision. Note: This flag is mandatory, as the NPU does not support FP32 execution. Your model will fail to run without it.

  • NEURON_FLAG_MIN_GROUP_SIZE: This flag sets the minimum number of nodes for a subgraph to be offloaded to the NPU. For larger models, setting this to a value greater than 1 may improve performance.

NPU does not support dynamic op shapes, so it is imperative to ensure dynamic shapes have been made static. You can find more information on making dynamic shapes static here

3 Likes

Hi Narain,
I would like to run an FP16 ONNX model on the NPU, such as yolov8s-fp16.onnx.
The yolov8s-fp16.onnx model was converted using the following command:

$ yolo export model=yolov8s.pt format=onnx dynamic=False opset=12

However, when using
providers = [(“NeuronExecutionProvider”, neuron_provider_options), “XnnpackExecutionProvider”, “CPUExecutionProvider”],
the following error occurs.

Traceback (most recent call last):
File “/root/ryan/testOnnx.py”, line 18, in
session = ort.InferenceSession(model_path, providers=providers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py”, line 465, in init
self._create_inference_session(providers, provider_options, disabled_optimizers)
File “/usr/lib/python3.12/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py”, line 537, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /usr/src/debug/onnxruntime/1.20.2/onnxruntime/core/optimizer/transformer_memcpy.cc:263 void onnxruntime::TransformerMemcpyImpl::ProcessDefs(onnxruntime::Node&, const onnxruntime::KernelRegistryManager&, onnxruntime::InitializedTensorSet&) Execution type ‘XnnpackExecutionProvider’ doesn’t support memcpy

When changing to providers = [(“NeuronExecutionProvider”, neuron_provider_options)], the model can run, but it seems to be falling back to the CPU.
In my testing, the FP32 inference time is 61.90 ms (16.16 FPS), while the FP16 inference time is 692.48 ms (1.44 FPS).

Is it possible to execute an FP16 ONNX model directly on the Genio 720/520 NPU using Python?
Thanks, and look forward to reading your response.

Hi @YuChin_Lin ,

FP32 models on the NeuronExecutionProvider are executed at FP16 precision.
Direct support for FP16 is currently under development, but aside from the reduction in model size, the accuracy and precision remain the same.
I suggest continuing to deploy FP32 models on the EP, as they will be executed at FP16 precision.
For faster inference, you can quantize to INT8. The Neuron EP supports QDQ ops.