NanoLLM - Optimized LLM Inference
NanoLLM
is a lightweight, high-performance library using optimized inferencing APIs for quantized LLM’s, multimodality, speech services, vector databases with RAG, and web frontends like Agent Studio.
It provides similar APIs to HuggingFace, backed by highly-optimized inference libraries and quantization tools:
NanoLLM Reference Documentation
from nano_llm import NanoLLM
model = NanoLLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct", # HuggingFace repo/model name, or path to HF model checkpoint
api='mlc', # supported APIs are: mlc, awq, hf
api_token='hf_abc123def', # HuggingFace API key for authenticated models ($HUGGINGFACE_TOKEN)
quantization='q4f16_ft' # q4f16_ft, q4f16_1, q8f16_0 for MLC, or path to AWQ weights
)
response = model.generate("Once upon a time,", max_new_tokens=128)
for token in response:
print(token, end='', flush=True)
Containers
To test a chat session with Llama from the command-line, install jetson-containers
and run NanoLLM like this:
git clone https://github.com/dusty-nv/jetson-containers
bash jetson-containers/install.sh
jetson-containers run \
--env HUGGINGFACE_TOKEN=hf_abc123def \
$(autotag nano_llm) \
python3 -m nano_llm.chat --api mlc \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--prompt "Can you tell me a joke about llamas?"
jetson-containers run \
--env HUGGINGFACE_TOKEN=hf_abc123def \
$(autotag nano_llm) \
python3 -m nano_llm.studio
If you haven't already, request access to the Llama models on HuggingFace and substitute your account's API token above.
Resources
Here's an index of the various tutorials & examples using NanoLLM on Jetson AI Lab:
Benchmarks | Benchmarking results for LLM, SLM, VLM using MLC/TVM backend. |
API Examples | Python code examples for chat, completion, and multimodal. |
Documentation | Reference documentation for the NanoLLM model and agent APIs. |
Llamaspeak | Talk verbally with LLMs using low-latency ASR/TTS speech models. |
Small LLM (SLM) | Focus on language models with reduced footprint (7B params and below) |
Live LLaVA | Realtime live-streaming vision/language models on recurring prompts. |
Nano VLM | Efficient multimodal pipeline with one-shot image tagging and RAG support. |
Agent Studio | Rapidly design and experiment with creating your own automation agents. |