Accelerating AI on Genio with the ONNX Runtime NeuronExecutionProvider

Starting PR4, Our Genio Yocto builds now include built-in support for ONNX Runtime (ORT), making it easier than ever to deploy high-performance AI models.

We currently support ORT version 1.20.2 with the following execution providers (EPs):

  • CPUExecutionProvider

  • XnnpackExecutionProvider

  • NeuronExecutionProvider

Introducing the NeuronExecutionProvider

The NeuronExecutionProvider is the official ONNX Runtime execution provider for leveraging the power of the MediaTek NPU. When you use this EP, it intelligently deploys your model to the NPU, and any unsupported operations automatically fall back to the CPU for seamless execution. It supports both the Python and C++ APIs.

Platform Availability

  • Genio 720 / Genio 520: The NeuronExecutionProvider is available by default.

  • All other Genio products: Have default access to CPUExecutionProvider and XnnpackExecutionProvider.

  • Genio 700 / Genio 510: To enable the NeuronExecutionProvider, please contact your MediaTek representative for more information.

Getting Started: Python API Example

Here is a simple example of how to initialize an ONNX Runtime session with the NeuronExecutionProvider on Genio 720/520:

#Dummy Python Example Script for inferencing FP32 ONNX model

import onnxruntime as ort
import numpy as np

# Path to your ONNX model
model_path = "your_model.onnx"

# Neuron provider options (set as needed)
neuron_provider_options = {
    "NEURON_FLAG_USE_FP16": "1",           # Allow FP32 to execute as FP16
    "NEURON_FLAG_MIN_GROUP_SIZE": "1"      # Set minimum subgraph size
}

# Create session with NeuronExecutionProvider and options
providers = [("NeuronExecutionProvider", neuron_provider_options), "XnnpackExecutionProvider", "CPUExecutionProvider"]
session = ort.InferenceSession(model_path, providers=providers)

# Generate dummy input data based on model input shape and type
def generate_dummy_inputs(session):
    inputs = {}
    for input_meta in session.get_inputs():
        shape = [dim if isinstance(dim, int) else 1 for dim in input_meta.shape]
        data = np.random.randn(*shape).astype(np.float32)
        inputs[input_meta.name] = data
    return inputs

# Prepare inputs and run inference
dummy_inputs = generate_dummy_inputs(session)
output_names = [output.name for output in session.get_outputs()]
outputs = session.run(output_names, dummy_inputs)

print("Inference outputs:", outputs)


The NeuronExecutionProvider has two important provider options:

  • NEURON_FLAG_USE_FP16: Executes FP32 models in FP16 precision. Note: This flag is mandatory, as the NPU does not support FP32 execution. Your model will fail to run without it.

  • NEURON_FLAG_MIN_GROUP_SIZE: This flag sets the minimum number of nodes for a subgraph to be offloaded to the NPU. For larger models, setting this to a value greater than 1 may improve performance.

NPU does not support dynamic op shapes, so it is imperative to ensure dynamic shapes have been made static. You can find more information on making dynamic shapes static here

2 Likes