TensorRT Engine Files

The .engine and .plan file formats are used by NVIDIA's TensorRT for high-performance deep learning inference.

Functionality

TensorRT engine files contain serialized execution plans optimized for specific GPU hardware and driver versions. These optimizations lead to:

Low Latency: Faster inference speeds compared to standard frameworks.
High Throughput: More inferences per second on the same hardware.

Optimization Techniques

Technique	Description
Layer Fusion	Merges multiple operations into a single kernel (e.g., Conv + ReLU).
Precision Calibration	Supports FP32, FP16, and INT8 quantization for efficiency.
Kernel Auto-tuning	Selects the best kernel implementation for the target hardware.
Memory Optimization	Reduces the memory footprint by managing intermediate tensor memory.

Creating Engine Files

From ONNX (Recommended)

import tensorrt as trt

# Create builder
logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)

# Parse ONNX model
with open("model.onnx", "rb") as model:
    parser.parse(model.read())

# Build and serialize engine
config = builder.create_builder_config()
engine = builder.build_engine(network, config)

with open("model.engine", "wb") as f:
    f.write(engine.serialize())

From PyTorch (Torch-TensorRT)

import torch
import torch_tensorrt

# Compile to TensorRT engine
trt_model = torch_tensorrt.compile(
    model,
    inputs=[torch_tensorrt.Input((1, 3, 224, 224), dtype=torch.half)],
    enabled_precisions={torch.half},  # FP16
    workspace_size=1 << 30
)

# Save engine
torch.jit.save(trt_model, "model_torchtrt.engine")

Hardware Binding

TensorRT engine files are bound to the specific GPU architecture they were built on. An engine built on a Tesla V100 will not run on a GeForce RTX 3090.

Loading and Using Engines

import tensorrt as trt

# Load engine
with open("model.engine", "rb") as f, trt.Runtime(trt.Logger(trt.Logger.INFO)) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())

# Create context for inference
context = engine.create_execution_context()

Pro Tip

Building optimized engines can take several minutes. For production, pre-build engines and cache them to reduce application startup time.

Functionality​

Optimization Techniques​

Creating Engine Files​

From ONNX (Recommended)​

From PyTorch (Torch-TensorRT)​

Loading and Using Engines​