How to resolve DLA permission issues when packaging Qwen2.5 LLM into a standalone APK

Dear All:

Quantized and compiled Qwen2.5-3B, which took around 60 hours. The parameter configuration adopts sym8W_sym16A, combined with Hessian weight optimization and cumulative layer error. The target shapes are 128t2048c for prompts and 1t2048c for generation. The DLA has been successfully generated and verified to run properly under ADB.

The current bottleneck is permission control. At present, I rely on adb root and setenforce 0 to resolve permission limitations via the ADB shell. I would like to consult:

If I package the project into a standalone APK, how can I fundamentally fix these permission issues?

Furthermore, for a general

-purpose APK integrated with this model, what solutions can meet app store listing specifications and achieve stable compatibility across all MTK ecosystem devices? Are there any alternative technical workarounds required?

Fellow developers, welcome to join the discussion if you have any doubts.

Hi Jiahao_Zhao,

Thanks for reaching out!

For the permission issue, you can use the U-SDK (Neuron Adapter API) to load and execute the DLA file. This is the standard approach for loading DLA in a regular Android App environment — no root permission required.

Here is a breakdown of the APIs you can directly adopt in your project:

1. Load the precompiled DLA (Extension mechanism)

  • NeuronModel_create
  • NeuronModel_getExtensionOperandType
  • NeuronModel_getExtensionOperationType
  • NeuronModel_addOperand / NeuronModel_setOperandValue
  • NeuronModel_addOperation
  • NeuronModel_identifyInputsAndOutputs / NeuronModel_finish

2. Compilation & execution instance

  • NeuronCompilation_create / NeuronCompilation_finish
  • NeuronExecution_create

3. Zero-copy I/O (AHardwareBuffer)

  • AHardwareBuffer_allocate / AHardwareBuffer_lock / AHardwareBuffer_unlock
  • NeuronMemory_createFromAHardwareBuffer
  • NeuronExecution_setInputFromMemory / NeuronExecution_setOutputFromMemory

4. Inference & resource release

  • NeuronExecution_compute
  • NeuronExecution_free / NeuronMemory_free / NeuronCompilation_free / NeuronModel_free
  • AHardwareBuffer_release

Best,
Jun

Hi, Jun:

Could you please provide any Android app development examples or reference documents for DLA usage?

Thank you very much!

Regards

Hi Jiahao_Zhao,

Below is a step-by-step guide on how to load a DLA and perform inference using the Neuron Adapter API. Each step is accompanied by the corresponding code snippet for your reference.

Note: The code snippets below are simplified examples for clarity, focusing on the core flow of loading a DLA and running inference. Please adapt the code to your project’s needs.


Parameter Definitions (recommended to place at the top of your file)

#define DLA_INPUT_BIN_SIZE                   376832
#define DLA_OUTPUT_BIN_SIZE                  1536000
#define RESTORE_DLA_EXTENSION_OPERAND_TYPE   0x0100
#define RESTORE_DLA_EXTENSION_OPERATION_TYPE 0x0000
#define RESTORE_DLA_EXTENSION_NAME           "com.mediatek.compiled_network"
Macro Description
DLA_INPUT_BIN_SIZE Byte size of the model input tensor; please adjust according to your DLA spec
DLA_OUTPUT_BIN_SIZE Byte size of the model output tensor; please adjust according to your DLA spec
RESTORE_DLA_EXTENSION_OPERAND_TYPE Operand type ID of the MTK extension “DLA Raw Data” (fixed value)
RESTORE_DLA_EXTENSION_OPERATION_TYPE Operation type ID of the MTK extension “Load DLA” (fixed value)
RESTORE_DLA_EXTENSION_NAME Extension name string (fixed value)

Step 1. Read the DLA file into memory

std::ifstream input_dla(dla_path, std::ios::binary);
input_dla.seekg(0, input_dla.end);
int length = input_dla.tellg();
input_dla.seekg(0, input_dla.beg);
char* buffer = static_cast<char*>(malloc(length * sizeof(char)));
input_dla.read(buffer, length);
input_dla.close();

For APK deployment, please change dla_path to the app’s private directory (corresponding to Context.getFilesDir() in Java/Kotlin), so that the file can be accessed without root permission.

Step 2. Create a NeuronModel

NeuronModel* model = nullptr;
NeuronModel_create(&model);

Step 3. Define the Input / Output tensor types

// Input
NeuronOperandType tensorInputType;
tensorInputType.type = NEURON_TENSOR_QUANT8_ASYMM;
tensorInputType.scale = 1.0f;
tensorInputType.zeroPoint = 0;
tensorInputType.dimensionCount = 1;
uint32_t dims_input[1] = {DLA_INPUT_BIN_SIZE};
tensorInputType.dimensions = dims_input;

// Output
NeuronOperandType tensorOutputType;
tensorOutputType.type = NEURON_TENSOR_QUANT8_ASYMM;
tensorOutputType.scale = 1.0f;
tensorOutputType.zeroPoint = 0;
tensorOutputType.dimensionCount = 1;
uint32_t dims_output[1] = {DLA_OUTPUT_BIN_SIZE};
tensorOutputType.dimensions = dims_output;

Step 4. Get the DLA extension operand type

int32_t operandType = 0;
NeuronModel_getExtensionOperandType(model,
    RESTORE_DLA_EXTENSION_NAME,
    RESTORE_DLA_EXTENSION_OPERAND_TYPE,
    &operandType);

NeuronOperandType extenOperandType;
extenOperandType.type = operandType;
extenOperandType.scale = 0.0f;
extenOperandType.zeroPoint = 0;
extenOperandType.dimensionCount = 0;

Step 5. Add operands to the model

NeuronModel_addOperand(model, &tensorInputType);   // 0: model input 1
NeuronModel_addOperand(model, &tensorInputType);   // 1: model input 2
NeuronModel_addOperand(model, &extenOperandType);  // 2: DLA Raw Data
NeuronModel_addOperand(model, &tensorOutputType);  // 3: model output
Index Role
0 Input tensor 1
1 Input tensor 2
2 DLA raw data (extension operand)
3 Output tensor

Step 6. Feed the DLA buffer into the model

NeuronModel_setOperandValue(model, 2, buffer, length);

This is the key step for “loading the precompiled DLA”.

Step 7. Get the DLA extension operation type and add the operation

int32_t operationType = 0;
NeuronModel_getExtensionOperationType(model,
    RESTORE_DLA_EXTENSION_NAME,
    RESTORE_DLA_EXTENSION_OPERATION_TYPE,
    &operationType);

uint32_t addInputIndexes[3] = {0, 1, 2};
uint32_t addOutputIndexes[1] = {3};
NeuronModel_addOperation(model,
    static_cast<NeuronOperationType>(operationType),
    3, addInputIndexes,
    1, addOutputIndexes);

Step 8. Identify the model’s inputs/outputs and finish the model

uint32_t modelInputIndexes[2] = {0, 1};
uint32_t modelOutputIndexes[1] = {3};
NeuronModel_identifyInputsAndOutputs(model, 2, modelInputIndexes, 1, modelOutputIndexes);
NeuronModel_finish(model);

Step 9. Create and finish the Compilation

NeuronCompilation* compilation = nullptr;
NeuronCompilation_create(model, &compilation);
NeuronCompilation_finish(compilation);
free(buffer);   // The DLA buffer can be released after compilation is finished

Step 10. Create the Execution

NeuronExecution* run = nullptr;
NeuronExecution_create(compilation, &run);

Step 11. Set the input via AHardwareBuffer

const uint32_t input_size = DLA_INPUT_BIN_SIZE;
const auto usage = AHARDWAREBUFFER_USAGE_CPU_READ_OFTEN
                 | AHARDWAREBUFFER_USAGE_CPU_WRITE_OFTEN;

AHardwareBuffer_Desc iDesc{
    .width = input_size,
    .height = 1,
    .layers = 1,
    .format = AHARDWAREBUFFER_FORMAT_BLOB,
    .usage = usage,
    .stride = input_size,
};

AHardwareBuffer* inputAhwb = nullptr;
AHardwareBuffer_allocate(&iDesc, &inputAhwb);

void* inputBuffer = nullptr;
AHardwareBuffer_lock(inputAhwb, usage, -1, nullptr, &inputBuffer);
memcpy(inputBuffer, input_buffer, input_size);
AHardwareBuffer_unlock(inputAhwb, nullptr);

NeuronMemory* iMemory = nullptr;
NeuronMemory_createFromAHardwareBuffer(inputAhwb, &iMemory);
NeuronExecution_setInputFromMemory(run, 0, nullptr, iMemory, 0, input_size);
NeuronExecution_setInputFromMemory(run, 1, nullptr, iMemory, 0, input_size);

Step 12. Set the output via AHardwareBuffer

const uint32_t output_size = DLA_OUTPUT_BIN_SIZE;

AHardwareBuffer_Desc oDesc{
    .width = output_size,
    .height = 1,
    .layers = 1,
    .format = AHARDWAREBUFFER_FORMAT_BLOB,
    .usage = usage,
    .stride = output_size,
};

AHardwareBuffer* outputAhwb = nullptr;
AHardwareBuffer_allocate(&oDesc, &outputAhwb);

NeuronMemory* oMemory = nullptr;
NeuronMemory_createFromAHardwareBuffer(outputAhwb, &oMemory);
NeuronExecution_setOutputFromMemory(run, 0, nullptr, oMemory, 0, output_size);

The output buffer does not need to be locked or written in advance; simply read it after inference completes.

Step 13. Run inference

NeuronExecution_compute(run);

Step 14. Read the output result

void* outputBuffer = nullptr;
AHardwareBuffer_lock(outputAhwb, usage, -1, nullptr, &outputBuffer);
// TODO: Read the inference result from outputBuffer and pass it to the downstream pipeline
AHardwareBuffer_unlock(outputAhwb, nullptr);

Step 15. Release resources

AHardwareBuffer_release(inputAhwb);
AHardwareBuffer_release(outputAhwb);
NeuronExecution_free(run);
NeuronMemory_free(iMemory);
NeuronMemory_free(oMemory);
free(input_buffer);
NeuronCompilation_free(compilation);
NeuronModel_free(model);

Best,
Jun