Skip to main content

TensorRT Engine Files

The .engine and .plan file formats are used by NVIDIA's TensorRT for high-performance deep learning inference.

Functionality

TensorRT engine files contain serialized execution plans optimized for specific GPU hardware and driver versions. These optimizations lead to:

  • Low Latency: Faster inference speeds compared to standard frameworks.
  • High Throughput: More inferences per second on the same hardware.

Optimization Techniques

TechniqueDescription
Layer FusionMerges multiple operations into a single kernel (e.g., Conv + ReLU).
Precision CalibrationSupports FP32, FP16, and INT8 quantization for efficiency.
Kernel Auto-tuningSelects the best kernel implementation for the target hardware.
Memory OptimizationReduces the memory footprint by managing intermediate tensor memory.

Creating Engine Files

import tensorrt as trt

# Create builder
logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)

# Parse ONNX model
with open("model.onnx", "rb") as model:
parser.parse(model.read())

# Build and serialize engine
config = builder.create_builder_config()
engine = builder.build_engine(network, config)

with open("model.engine", "wb") as f:
f.write(engine.serialize())

From PyTorch (Torch-TensorRT)

import torch
import torch_tensorrt

# Compile to TensorRT engine
trt_model = torch_tensorrt.compile(
model,
inputs=[torch_tensorrt.Input((1, 3, 224, 224), dtype=torch.half)],
enabled_precisions={torch.half}, # FP16
workspace_size=1 << 30
)

# Save engine
torch.jit.save(trt_model, "model_torchtrt.engine")
Hardware Binding

TensorRT engine files are bound to the specific GPU architecture they were built on. An engine built on a Tesla V100 will not run on a GeForce RTX 3090.

Loading and Using Engines

import tensorrt as trt

# Load engine
with open("model.engine", "rb") as f, trt.Runtime(trt.Logger(trt.Logger.INFO)) as runtime:
engine = runtime.deserialize_cuda_engine(f.read())

# Create context for inference
context = engine.create_execution_context()
Pro Tip

Building optimized engines can take several minutes. For production, pre-build engines and cache them to reduce application startup time.