TFLite model gives apusys memImport error in online compilation but works in offline pathway

Hi,

I have a TFLite model which has input shape 518x518. When i execute the model on stable_delegate, i get the below apusys memImport error

[apusys][info]construct: Cmd v5(0xaaaae002c9c0): total_vlm_size(516096)
INFO: Explicitly applied STABLE_DELEGATE delegate, and the model graph will be partially executed by the delegate w/ 25 delegate kernels.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: The input model file size (MB): 99.1302
INFO: Initialized session in 4630.01ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
[apusys][error]memImport: import mem(89) fail(Cannot allocate memory)
[apusys][error]memImport: import mem(0x0/45045600) handle(89) flags(0) fail(Cannot allocate memory)
[apusys][error]memImport: import memory(89/45045600) fail
ERROR: APUSysEngine::MemImportV1() failed to import handle 89
ERROR: APUSys imports ION buffer fd = 89 failed
ERROR: HintFrontendBuffer() failed on input #0, buffer addr = 0xfffe4fd76000 on DMM = neuron::platforms::apusys::V2_0::APUSysEngine
ERROR: Neuron returned error NEURON_BAD_DATA at line 1506 while associating NNAPI execution input with a memory object.

ERROR: Node number 827 (DELEGATE) failed to invoke.
INFO: count=1 curr=2151260

ERROR: Benchmarking failed.

i created a quick script to figure out the ops consuming max memory in my model, the output for which is as below:

— Top 10 Largest Tensors in model.tflite —
Tensor Index: 230 Size: 42.96 MB Shape: [6, 1370, 1370] Produced by: Op 32 (BATCH_MATMUL)
Tensor Index: 231 Size: 42.96 MB Shape: [6, 1370, 1370] Produced by: Op 33 (SOFTMAX)
Tensor Index: 289 Size: 42.96 MB Shape: [6, 1370, 1370] Produced by: Op 91 (BATCH_MATMUL)
Tensor Index: 290 Size: 42.96 MB Shape: [6, 1370, 1370] Produced by: Op 92 (SOFTMAX)
Tensor Index: 348 Size: 42.96 MB Shape: [6, 1370, 1370] Produced by: Op 150 (BATCH_MATMUL)
Tensor Index: 349 Size: 42.96 MB Shape: [6, 1370, 1370] Produced by: Op 151 (SOFTMAX)
Tensor Index: 407 Size: 42.96 MB Shape: [6, 1370, 1370] Produced by: Op 209 (BATCH_MATMUL)
Tensor Index: 408 Size: 42.96 MB Shape: [6, 1370, 1370] Produced by: Op 210 (SOFTMAX)
Tensor Index: 466 Size: 42.96 MB Shape: [6, 1370, 1370] Produced by: Op 268 (BATCH_MATMUL)
Tensor Index: 467 Size: 42.96 MB Shape: [6, 1370, 1370] Produced by: Op 269 (SOFTMAX)

But the issue is, if I use offline pathway instead, the model gets compiled to DLA using ncc-tflite without any issue. Using --show-memory-summary flag, i get below output for the model:

DRAM Usage:
      Target       Input  Output  Temp  Static  Code  Total
[ 0]: eDMA 3.6     3.1M   0       0     0       64    3.1M
[ 1]: MDLA 5.3     0      1.0M    0     47M     171K  48M

Total Memory:
DRAM: 51M + 50M (Shared) = 102M
L1:   896K
TargetExecutionOrder:
 APUSYS_2_0

Herein, static memory usage is again 47mb, and it can be compiled to DLA easily,
So why when we are using online compilation, we end up getting memImport error? Is this error in compilation stage or execution and what are the NPU memory parameter limits for G720?

Hi Suyash,

Thank you for this detailed report. Here gets you some way to investigate the issue.

Root Cause

  • The error memImport: Cannot allocate memory occurs when the APUSys engine attempts to import a large buffer from user space into the NPU domain during execution.
  • In the online pathway, the stable delegate maps selected tensors to APUSys. Very large, contiguous buffers—like the ~43 MB attention-like outputs you observed—can exceed the available contiguous importable memory at that moment, leading to memImport failure.
  • This is a run-time error, not an offline compile-time problem. Offline compilation happens on your host PC and does not exercise device-side buffer import or allocation pressure.

Online vs. Offline Pathway Behavior

  • Online compilation (TFLite + stable delegate):
    • Performs graph partitioning and operator delegation on-device.
    • Must import and manage large runtime buffers on the device; memory fragmentation and contiguous allocation constraints can trigger failures.
  • Offline compilation (ncc-tflite → DLA):
    • Compiles the model on host PC; device-side compile memory is not involved.
    • At run-time, the DLA binary may use different buffer layouts and scheduling, often reducing peak import pressure compared with online delegation.
    • Note: Genio-720 (MDLA5.3) requires compiling for EDMA, not EDPA. When using offline pathway, ensure --edma=3.6 is specified.

Is it compilation or execution failure?

  • It is an execution-time buffer import failure in the online pathway. The delegate has already created a plan (“25 delegate kernels” in your log); the failure occurs when importing a large input/output/intermediate buffer into APUSys just before or during execution.

Practical Mitigations

  1. Prefer the offline pathway for deployment

    • Compile on host using:
      • ncc-tflite --arch=mdla5.3 --edma=3.6 <model.tflite> -o <model.dla> --relax-fp32 --opt=3
    • Run on device with neuronrt and validate performance and stability.
  2. Reduce peak tensor sizes

    • Lower the input resolution (e.g., from 518x518 down to 512x512 or 448x448) to shrink attention-like intermediates.
  3. Quantize the model

    • Use PTQ or QAT to convert heavy FP32 intermediates to INT8 wherever supported; smaller element size reduces buffer footprint.
  4. Eliminate or tile large BATCH_MATMUL-like ops

    • Where feasible, restructure or tile large matrix multiplications to reduce contiguous buffer requirements. This often requires model-level changes.
  5. Validate with Android Neuron Delegate

    • For NP8 devices, test the same model using the Neuron Delegate on Genio-720 Android images to check if the runtime handles memory differently from the Yocto stable delegate.
  6. Baseline the model without NPU delegation

    • Temporarily disable the stable delegate to confirm CPU-only execution (XNNPACK) stability. This isolates NPU memory import issues from general model correctness.

Notes on “NPU memory limits” on Genio-720

  • Effective device-side memory for APUSys imports depends on the image build, reserved pools, and current fragmentation, not just total DRAM size.
  • Large contiguous imports can fail even when total free memory is sufficient. This is expected behavior under heavy fragmentation.