Skip to content

NanoLLM - Optimized LLM Inference

NanoLLM is a lightweight, high-performance library using optimized inferencing APIs for quantized LLM’s, multimodality, speech services, vector databases with RAG, and web frontends like Agent Studio.

It provides similar APIs to HuggingFace, backed by highly-optimized inference libraries and quantization tools:

NanoLLM Reference Documentation
from nano_llm import NanoLLM

model = NanoLLM.from_pretrained(
   "meta-llama/Meta-Llama-3-8B-Instruct",  # HuggingFace repo/model name, or path to HF model checkpoint
   api='mlc',                              # supported APIs are: mlc, awq, hf
   api_token='hf_abc123def',               # HuggingFace API key for authenticated models ($HUGGINGFACE_TOKEN)
   quantization='q4f16_ft'                 # q4f16_ft, q4f16_1, q8f16_0 for MLC, or path to AWQ weights

response = model.generate("Once upon a time,", max_new_tokens=128)

for token in response:
   print(token, end='', flush=True)


To test a chat session with Llama from the command-line, install jetson-containers and run NanoLLM like this:

git clone
bash jetson-containers/
jetson-containers run \
  --env HUGGINGFACE_TOKEN=hf_abc123def \
  $(autotag nano_llm) \
    python3 -m --api mlc \
      --model meta-llama/Meta-Llama-3-8B-Instruct \
      --prompt "Can you tell me a joke about llamas?"
jetson-containers run \
  --env HUGGINGFACE_TOKEN=hf_abc123def \
  $(autotag nano_llm) \
    python3 -m

If you haven't already, request access to the Llama models on HuggingFace and substitute your account's API token above.


Here's an index of the various tutorials & examples using NanoLLM on Jetson AI Lab:

Benchmarks Benchmarking results for LLM, SLM, VLM using MLC/TVM backend.
API Examples Python code examples for chat, completion, and multimodal.
Documentation Reference documentation for the NanoLLM model and agent APIs.
Llamaspeak Talk verbally with LLMs using low-latency ASR/TTS speech models.
Small LLM (SLM) Focus on language models with reduced footprint (7B params and below)
Live LLaVA Realtime live-streaming vision/language models on recurring prompts.
Nano VLM Efficient multimodal pipeline with one-shot image tagging and RAG support.
Agent Studio Rapidly design and experiment with creating your own automation agents.