During MediaTek DLA model conversion with ncc-tflite
, enabling the --opt-accuracy
flag produces results that closely match CPU execution, while omitting this flag can lead to significant accuracy deviations.
-
What changes does --opt-accuracy
introduce in the conversion process?
-
For which models is it necessary?
-
What is the impact on inference performance?
-
The --opt-accuracy
flag in ncc-tflite
instructs Neuron to apply additional optimization methods that improve inference accuracy, aligning results more closely with CPU execution.
- Without
--opt-accuracy
, the tool prioritizes performance, sometimes causing quantization or operation simplifications that may significantly differ from FP32 reference results—especially when using --relax-fp32
(forcing FP16 execution on MDLA, since FP32 is generally unsupported).
- With
--opt-accuracy
, the compiler adjusts for known accuracy issues:
- Converts
int16
data to fp16
.
- Sets
conv2d
layer biases to zero and introduces a channel-wise addition for improved precision.
- Increases average pooling (
avgpool2d
) cascade depth.
- Other operations may gain extra steps to improve output fidelity.
- These changes may slightly increase model complexity and inference latency, but often result in near-reference accuracy.
-
When to use --opt-accuracy
:
- Strongly recommended whenever output accuracy is critical for your application, or if initial conversion/testing results differ from expected CPU outputs.
- Particularly important for models with sensitive operations (
conv2d
, avgpool2d
, or custom ops sensitive to quantization).
-
Performance impact:
- In practical tests (example: YOLOv3-tiny on MDLA 3.0), the average difference in per-inference latency was minimal (e.g., 9.19ms vs. 9.23ms per run), indicating negligible performance loss for most real-world use cases.
- The model’s size or runtime may slightly increase due to extra steps introduced for accuracy.
Recommendation:
Always validate results with and without --opt-accuracy
. When in doubt, prioritize using --opt-accuracy
to ensure alignment with CPU inference behavior, especially for production or customer-facing AI pipelines.