Skip to content

TensorRT-LLM for Jetson

TensorRT-LLM is a high-performance LLM inference library with advanced quantization, attention kernels, and paged KV caching. Initial support for building TensorRT-LLM from source for JetPack 6.1 has been included in the v0.12.0-jetson branch of the TensorRT-LLM repo for Jetson AGX Orin.

We've provided pre-compiled TensorRT-LLM wheels and containers along with this guide for TensorRT-LLM Deployment on Jetson Orin

What you need

  1. One of the following Jetson devices:

    Jetson AGX Orin *Support for other Orin devices is currently undergoing testing.

  2. Running one of the following versions of JetPack :

    JetPack 6.1 (L4T r36.4)

  3. Sufficient storage space (preferably with NVMe SSD).

    • 18.5GB for tensorrt_llm container image
    • Space for models ( >10GB )
  4. Clone and setup jetson-containers :

    git clone https://github.com/dusty-nv/jetson-containers
    bash jetson-containers/install.sh
    

Building TensorRT-LLM Engine for Llama

You can find the steps for converting Llama to TensorRT-LLM under examples/llama in the repo, and also in the documentation . This script will automate the process for Llama-7B with INT4 quantization applied, and run some generation and performance checks on the model:

jetson-containers run \
  -e HUGGINGFACE_TOKEN=YOUR_API_KEY \
  -e FORCE_BUILD=on \
  dustynv/tensorrt_llm:0.12-r36.4.0 \
    /opt/TensorRT-LLM/llama.sh

There are many such conversion procedures outlined in the TensorRT-LLM examples for different model architectures.

OpenAI API Endpoint

TensorRT-LLM has programming APIs for Python and C++ available, but it also includes an example server endpoint for the OpenAI protocol that makes it easy to substitute for other local or cloud model backends.

This will start the TensorRT-LLM container with the server and model that you built above:

jetson-containers run \
  dustynv/tensorrt_llm:0.12-r36.4.0 \
  python3 /opt/TensorRT-LLM/examples/apps/openai_server.py \
    /data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq

Then you can make chat completion requests against it from practically any language or from any connected device. This example shows a simple way of testing it initially from another terminal with curl:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": <model_name>,
        "prompt": "Where is New York?",
        "max_tokens": 16,
        "temperature": 0
    }'

Or the code included with openai_client.py will handle these requests using the standard openai-python package that can be installed outside of the container or on another machine.

jetson-containers run \
  --workdir /opt/TensorRT-LLM/examples/apps \
  dustynv/tensorrt_llm:0.12-r36.4.0 \
    python3 openai_client.py --prompt "Where is New York?" --api chat

The patches in the branch above for TensorRT-LLM 0.12 are a preview release for Jetson AGX Orin, and we'll continue with validating and testing the various settings in TensorRT-LLM. If you need any support, please post to the Jetson Developer Forums. .