CV工程师
2023-05-06 06:07:19 阅读:844
第一步先给Jetson插上屏幕、鼠标和键盘,然后插电自动开机。
开机以后让Jetson连接网络(我这里是局域网),连接网络以后就可以使用ssh远程控制了。
这边安利一个免费开源的ssh程序:https://github.com/Eugeny/tabby
在使用过程中我发现Jetson TX2的散热器很热,且没有开启风扇于是查询了一下,发现需要手动开启风扇。
sudo jetson_clocks --fan
sudo vi /sys/devices/pwm-fan/target_pwm
默认转速文件里面写的是255,我感觉噪音太大,修改成120以后,风扇声音基本上听不见了,并且散热片也是凉凉的。
远程连接以后要查看执行以下命令:
git clone https://github.com/jetsonhacks/jetsonUtilities
cd jetsonUtilities/
ls
python jetsonInfo.py
执行完毕以后我这里显示出了当前的环境:
NVIDIA TW NVIDIA Jetson Nano Developer Kit
L4T 32.6.1 [ JetPack 4.6 ]
Ubuntu 18.04.5 LTS
Kernel Version: 4.9.253-tegra
CUDA 10.2.300
CUDA Architecture: 5.3
OpenCV version: 4.1.1
OpenCV Cuda: NO
CUDNN: 8.2.1.32
TensorRT: 8.0.1.6
Vision Works: 1.6.0.501
VPI: 1.1.12
Vulcan: 1.2.70
由于今天主要是测试一个onnx模型转换成engine模型后的推理速度,因此先检查了一下TensorRT的环境 命令行输入
cd /usr/src/tensorrt/samples/
sudo make #编译大约7分钟
../bin/sample_mnist
之后会输入一个 6,泰裤辣!
nvidia@nano:/usr/src/tensorrt/samples$ ../bin/sample_mnist
&&&& RUNNING TensorRT.sample_mnist [TensorRT v8001] # ../bin/sample_mnist
[05/06/2023-14:35:56] [I] Building and running a GPU inference engine for MNIST
[05/06/2023-14:35:58] [I] [TRT] [MemUsageChange] Init CUDA: CPU +203, GPU -1, now: CPU 221, GPU 3318 (MiB)
[05/06/2023-14:35:58] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 223 MiB, GPU 3274 MiB
[05/06/2023-14:35:58] [I] [TRT] ---------- Layers Running on DLA ----------
[05/06/2023-14:35:58] [I] [TRT] ---------- Layers Running on GPU ----------
[05/06/2023-14:35:58] [I] [TRT] [GpuLayer] (Unnamed Layer* 9) [Constant]
[05/06/2023-14:35:58] [I] [TRT] [GpuLayer] PWN((Unnamed Layer* 10) [ElementWise], scale)
[05/06/2023-14:35:58] [I] [TRT] [GpuLayer] conv1
[05/06/2023-14:35:58] [I] [TRT] [GpuLayer] pool1
[05/06/2023-14:35:58] [I] [TRT] [GpuLayer] conv2
[05/06/2023-14:35:58] [I] [TRT] [GpuLayer] pool2
[05/06/2023-14:35:58] [I] [TRT] [GpuLayer] shuffle_between_pool2_and_ip1
[05/06/2023-14:35:58] [I] [TRT] [GpuLayer] ip1 + relu1
[05/06/2023-14:35:58] [I] [TRT] [GpuLayer] ip2
[05/06/2023-14:35:58] [I] [TRT] [GpuLayer] prob
[05/06/2023-14:36:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +158, GPU +161, now: CPU 381, GPU 3435 (MiB)
[05/06/2023-14:36:02] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +241, GPU -23, now: CPU 622, GPU 3412 (MiB)
[05/06/2023-14:36:02] [W] [TRT] Detected invalid timing cache, setup a local cache instead
[05/06/2023-14:36:19] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[05/06/2023-14:36:21] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[05/06/2023-14:36:21] [I] [TRT] Total Host Persistent Memory: 4960
[05/06/2023-14:36:21] [I] [TRT] Total Device Persistent Memory: 6144
[05/06/2023-14:36:21] [I] [TRT] Total Scratch Memory: 0
[05/06/2023-14:36:21] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 16 MiB
[05/06/2023-14:36:21] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 866, GPU 3442 (MiB)
[05/06/2023-14:36:21] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 866, GPU 3442 (MiB)
[05/06/2023-14:36:21] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 866, GPU 3443 (MiB)
[05/06/2023-14:36:21] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 865, GPU 3443 (MiB)
[05/06/2023-14:36:21] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 865 MiB, GPU 3443 MiB
[05/06/2023-14:36:21] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 867, GPU 3443 (MiB)
[05/06/2023-14:36:21] [I] [TRT] Loaded engine size: 1 MB
[05/06/2023-14:36:21] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 867 MiB, GPU 3443 MiB
[05/06/2023-14:36:21] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 867, GPU 3443 (MiB)
[05/06/2023-14:36:21] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 867, GPU 3443 (MiB)
[05/06/2023-14:36:21] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 867, GPU 3443 (MiB)
[05/06/2023-14:36:21] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 867 MiB, GPU 3443 MiB
[05/06/2023-14:36:21] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 864 MiB, GPU 3443 MiB
[05/06/2023-14:36:21] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 864, GPU 3443 (MiB)
[05/06/2023-14:36:21] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 864, GPU 3443 (MiB)
[05/06/2023-14:36:21] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 864 MiB, GPU 3443 MiB
[05/06/2023-14:36:21] [I] Input:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@%.:@@@@@@@@@@@@
@@@@@@@@@@@@@: *@@@@@@@@@@@@
@@@@@@@@@@@@* =@@@@@@@@@@@@@
@@@@@@@@@@@% :@@@@@@@@@@@@@@
@@@@@@@@@@@- *@@@@@@@@@@@@@@
@@@@@@@@@@# .@@@@@@@@@@@@@@@
@@@@@@@@@@: #@@@@@@@@@@@@@@@
@@@@@@@@@+ -@@@@@@@@@@@@@@@@
@@@@@@@@@: %@@@@@@@@@@@@@@@@
@@@@@@@@+ +@@@@@@@@@@@@@@@@@
@@@@@@@@:.%@@@@@@@@@@@@@@@@@
@@@@@@@% -@@@@@@@@@@@@@@@@@@
@@@@@@@% -@@@@@@#..:@@@@@@@@
@@@@@@@% +@@@@@- :@@@@@@@
@@@@@@@% =@@@@%.#@@- +@@@@@@
@@@@@@@@..%@@@*+@@@@ :@@@@@@
@@@@@@@@= -%@@@@@@@@ :@@@@@@
@@@@@@@@@- .*@@@@@@+ +@@@@@@
@@@@@@@@@@+ .:-+-: .@@@@@@@
@@@@@@@@@@@@+: :*@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
由于昨天我在windows下使用trtexec测试了一下模型的推理速度,因此今天也用trtexec来测试,windows下不需要编译,就可以直接测试:C:\TensorRT-8.4.0.6\bin
,也就是下载了对应的TensorRT以后bin目录下就可以直接测试了,可是在jetson上面我发现还需要编译一下,并且直接编译会报错。
在/usr/src/tensorrt/samples
目录下发现有trtexec文件夹,进去后发现有这几个文件:
Makefile prn_utils.py profiler.py README.md tracer.py trtexec.cpp
编译命令:
sudo make CUDA_INSTALL_DIR=/usr/local/cuda
编译成功后提示:
Compiling: ../common/sampleEngines.cpp
Linking: ../../bin/trtexec
回到bin目录/usr/src/tensorrt/bin
下,trtexec已经编译好了,之后添加环境变量:
vi ~/.bashrc
# 最后一行添加
export PATH=/usr/src/tensorrt/bin:$PATH
# 保存退出后
source ~/.bashrc
需要在电脑上使用onnxsim简化onnx模型,否则会报错
ModelImporter.cpp:682: Failed to parse ONNX model from file: best.onnx
onnxsim地址:https://github.com/daquexian/onnx-simplifier
得到简化的onnx模型以后,使用trtexec将onnx转换成trt模型:
trtexec --onnx=bestsim.onnx --fp16 --saveEngine=yolo.trt
过程比较长,需要耐心等待。
之后测试:
trtexec --loadEngine=yolo.trt --batch=1
结果如下:
[05/06/2023-18:06:44] [I] Average on 10 runs - GPU latency: 5.1874 ms - Host latency: 5.29001 ms (end to end 5.3407 ms, enqueue 5.16296 ms)
[05/06/2023-18:06:44] [I] Average on 10 runs - GPU latency: 5.15825 ms - Host latency: 5.25881 ms (end to end 5.30935 ms, enqueue 5.13562 ms)
[05/06/2023-18:06:44] [I] Average on 10 runs - GPU latency: 5.10278 ms - Host latency: 5.20054 ms (end to end 5.25024 ms, enqueue 5.08047 ms)
[05/06/2023-18:06:44] [I] Average on 10 runs - GPU latency: 5.35928 ms - Host latency: 5.46848 ms (end to end 5.52202 ms, enqueue 5.33289 ms)
[05/06/2023-18:06:44] [I] Average on 10 runs - GPU latency: 5.13918 ms - Host latency: 5.2394 ms (end to end 5.28948 ms, enqueue 5.11609 ms)
[05/06/2023-18:06:44] [I] Average on 10 runs - GPU latency: 5.40454 ms - Host latency: 5.51218 ms (end to end 5.5656 ms, enqueue 5.37795 ms)
[05/06/2023-18:06:44] [I] Average on 10 runs - GPU latency: 5.64929 ms - Host latency: 5.76643 ms (end to end 5.82332 ms, enqueue 5.62112 ms)
[05/06/2023-18:06:44] [I] Average on 10 runs - GPU latency: 5.41238 ms - Host latency: 5.52266 ms (end to end 5.57568 ms, enqueue 5.38662 ms)
[05/06/2023-18:06:44] [I]
[05/06/2023-18:06:44] [I] === Performance summary ===
[05/06/2023-18:06:44] [I] Throughput: 175.036 qps
[05/06/2023-18:06:44] [I] Latency: min = 3.97729 ms, max = 10.0389 ms, mean = 5.60913 ms, median = 5.57353 ms, percentile(99%) = 8.49084 ms
[05/06/2023-18:06:44] [I] End-to-End Host Latency: min = 4.0282 ms, max = 10.144 ms, mean = 5.66353 ms, median = 5.62885 ms, percentile(99%) = 8.54956 ms
[05/06/2023-18:06:44] [I] Enqueue Time: min = 4.83002 ms, max = 9.78864 ms, mean = 5.46975 ms, median = 5.43457 ms, percentile(99%) = 8.32886 ms
[05/06/2023-18:06:44] [I] H2D Latency: min = 0.00354004 ms, max = 0.108398 ms, mean = 0.0543735 ms, median = 0.0537109 ms, percentile(99%) = 0.0881348 ms
[05/06/2023-18:06:44] [I] GPU Compute Time: min = 3.90381 ms, max = 9.8241 ms, mean = 5.4972 ms, median = 5.4599 ms, percentile(99%) = 8.36316 ms
[05/06/2023-18:06:44] [I] D2H Latency: min = 0.00268555 ms, max = 0.111145 ms, mean = 0.0575538 ms, median = 0.0585938 ms, percentile(99%) = 0.0859375 ms
[05/06/2023-18:06:44] [I] Total Host Walltime: 3.01653 s
[05/06/2023-18:06:44] [I] Total GPU Compute Time: 2.90252 s
[05/06/2023-18:06:44] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/06/2023-18:06:44] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/06/2023-18:06:44] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/06/2023-18:06:44] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # trtexec --loadEngine=yolo.trt --batch=1
[05/06/2023-18:06:44] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 626, GPU 3520 (MiB)
测试完成
风扇相关: https://blog.csdn.net/qq_51491920/article/details/126282970 TensorRT相关: https://www.cnblogs.com/yinqiyu/p/16901750.html 编译trtexec:https://blog.csdn.net/zhw864680355/article/details/80976903
评论
扫描二维码获取文章详情
更多精彩内容尽在:WWW.ZNGG.NET