I’ve ben trying to deploy Qwen2.5-0.5B onto Genio-720 with Android 15, there is only limited document and post (excpt this one) about how to do this. so I am not able to do it so far.
The followings are my system configurations, along with my questions and issues, hope you can provide some hints for my reference.
[Development PC]
OS: ubuntu_20.04
Python: 3.8.10
Pip: 25.0.1
Other python packages:
numpy 1.24.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py3 7.352.0
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.9.86
nvidia-nvtx-cu12 12.1.105Demo: GAI-Deployment-Toolkit-v2.0.8_qwen2.5-0.5b-1.5b-7b-v0.1.tar.gz
Essential Hugging Face Model:
$ hf download Qwen/Qwen2.5-0.5B-Instruct --revision main --local-dir Qwen2.5-0.5B-InstructNeuron SDK Version:
neuropilot-sdk-basic-8.0.11-build20260211 (in use)
neuropilot-sdk-basic-8.0.7 (faild due to this error)LLM SDK:
mtk_llm_sdk in mtk_llm_sdk_v2.8.2.zip (in use)
mtk_llm_sdk_v2.7.5 in GAI-Deployment-Toolkit-v2.0.8_qwen2.5-0.5b-1.5b-7b-v0.1.tar.gzLLM converter and quantizer
neuropilot-sdk-basic-8.0.11-build20260211/offline_toolmtk_converter 8.16.0
mtk_llm_sdk 2.8.2
mtk-quantization 8.2.1Android NDK: android-ndk-r27d
[Target Device]
Platform: Genio-720OS:
Android 15 (aiot8391p2_64_bsp_alps-vf-mp-u0.mp5-mp-v0.mp5-V7.48.V4.47_user_raw)libneuron_runtime:
20250423_Neuron_SDK_v1.2517.03_neuron-8.0-release/mt8189/lib
/vendor/lib64/ (Due to security issues, Apps running under user shell is not able to load libs in this director y)
[Inference]
[compile]
#$ ./compile_generative.sh
$ ./compile_prompt_qwen2.5_0.5B_7B.sh
../post_training_quantize/tflite/Qwen2.5-0.5B-Instruct_asym4W_sym16A_Overall_hessian_wgt_opt_cum_layer_error_rotate_ortho_0_128t2048c/Qwen2.5-0.5B-Instruct_asym4W_sym16A_Overall_hessian_wgt_opt_cum_layer_error_rotate_ortho_0_24layer_128t2048c_0.tflite
~/workspace/MTK/neuropilot-sdk-basic-8.0.11-build20260211/neuron_sdk$ ./compile_generative.sh
../post_training_quantize/tflite/Qwen2.5-0.5B-Instruct_asym4W_sym16A_Overall_hessian_wgt_opt_cum_layer_error_rotate_ortho_0_1t2048c/Qwen2.5-0.5B-Instruct_asym4W_sym16A_Overall_hessian_wgt_opt_cum_layer_error_rotate_ortho_0_24layer_1t2048c_0.tflite
~/workspace/MTK/neuropilot-sdk-basic-8.0.11-build20260211/neuron_sdk[inference]
$ python3 ./scripts/prepare_huggingface_tokenizer.py -o ./assets/qwen_vocab ../Qwen2.5-0.5B-Instruct/tokenizer.json
Exported ‘vocab.txt’ from ‘tokenizer.json’
Exported ‘merges.txt’ from ‘tokenizer.json’
Exported ‘added_tokens.yaml’ from ‘tokenizer.json’
[Running demo ap p]
[config_qwen2.5_0.5b_instruct.ya ml]
modelOptions:
#Architecture
promptTokenBatchSize: 128
cacheSize: 2048
hiddenSize: 896
numHead: 14
numLayer: 24
maxTokenLength: 32768#Types
modelInputType: INT16
modelOutputType: INT16
cacheType: INT16
maskType: INT16
rotEmbType: INT16
rotEmbBase: 1000000.0runtimeOptions:
specialTokens:
bosId: 151644
eosId: 151645
addBos: false # Default to false if not set
stopToken: # Either a list or a value. Default to eosId if left blank
tokenizerPath:
- /data/local/tmp/llm_sdk/assets/qwen_vocab/merges.txt
- /data/local/tmp/llm_sdk/assets/qwen_vocab/vocab.txt
- /data/local/tmp/llm_sdk/assets/qwen_vocab/added_tokens.yaml
tokenEmbPath: /data/local/tmp/llm_sdk/assets/dla_qwen/embedding_Qwen2.5-0.5B- Instruct_int16.bin
dlaPromptPaths:- /data/local/tmp/llm_sdk/assets/dla_qwen/Qwen2.5-0.5B-Instruct/Qwen2.5-0.5B- Instruct_asym4W_sym16A_Overall_hessian_24layer_128t2048c_0_shared.dla
dlaGenPaths:- /data/local/tmp/llm_sdk/assets/dla_qwen/Qwen2.5-0.5B-Instruct/Qwen2.5-0.5B-Instruct_asym4W_sym16A_Overall_hessian_24layer_1t2048c_0_shared.dla
#sharedWeightsPaths:
#- /data/local/tmp/llm_sdk/assets/dla_qwen/Qwen2.5-0.5B-Instruct/shared_weights_2048c_0.bin
Questions about config_qwen2.5_0.5b_instruct.yaml
- How do I prepare/generate sharedWeightsPaths ?

