Deploying SSD mobileNet V2 on the NVIDIA Jetson and Nano platforms
For one of our clients we were asked to port an object detection neural network to an NVIDIA based mobile platform (Jetson and Nano based). The neural network, created in TensorFlow, was based on the SSD-mobilenet V2 network, but had a number of customizations to make it more suitable to the particular problem that the client faced. During the course of this project we realized that the available open-source resources had several problems for which there was no clear solutions. The problems are discussed in various places such as GitHub Issues against the TensorRT and TensorFlow models repository, but also on the NVIDIA developer forums and on StackOverflow. In this post we cover all the problems we faced and the solutions we found in the hope that it helps others with deploying their solutions on these mobile devices.
Training and Conversion Process
The conversion from a TensorFlow checkpoint to an optimized deployment binary requires the following steps:
Download and setup the TensorFlow Object Detection API
Train the network using new data starting from the downloaded checkpoint. When using your custom training data you often change the number of classes and the resolution, for this example we use the following settings: ● 6 object classes. ● An image resolution of 450 by 450 pixels.
Convert the frozen graph to an UFF file via TensorRT.
Load the UFF file and build a TensorRT execution engine.
For steps 1 to 4 we use the tools supplied by the Google TensorFlow team. As it turns out, the latest versions of those tools are incompatible with the publicly released tools and configuration files that are required for step 5 and 6. Since steps 5 and 6 are required to create a high performance, standalone, inference engine for our target devices we are in a bit of a pickle. One common solution is to not use the latest version of the Object Detection tools, but rather a version nearly 2 years old (early 2018). Although that works, it is not preferable as we lose out on two years of developments, features, bug fixes and optimizations. We therefore set out to fix the reported and observed problems in order to complete all the above-mentioned steps using the latest (at the time of publishing)* version of the TensorFlow Object Detection tool-set.
In the next section we cover the problems we have observed and the fixes we have found in order to solve the problems. By using all the below fixes we have been able to successfully (re)train MobileNet V2 (with different feature extraction back-ends), convert it to UFF and build a TensorRT execution engine.
TensorRT 188.8.131.52 (with this graphsurgeon fix)
TensorFlow Object Detection API. December 7, hash: 9cae3c4fb9e342b7c2b039f0853f1e74e58d4bbe
TensorRT conversion process
To convert a frozen graph to UFF you can use the convert-to-uff conversion tool (which is a wrapper around the uff.from_tensorflow API) and a config.py configuration file. The latter contains references to the plugins (including their settings) and graph modification operations required to convert the graph into a set of operations that is supported by TensorRT. Once a UFF file is generated you have to load it into the TensorRT UFF Parser in order to build an execution engine. These steps can either be integrated into a single program, as done in this example, or performed separately. In this post we use the single program setup, and the TRT_object_detection repository as base for the experiments.
When a correct configuration is used, the frozen graph is converted into a UFF file, which is then loaded by the parser to create a network. Finally, this network is used to build and optimize an execution engine for the target platform. If the configuration file used does not exactly match the settings used in the frozen graph then any of the above steps can fail. In the next section we list a number of the commonly observed problems and the steps you have to take to fix them.
Observed conversion problems
We can split the observed problems into two categories:
Errors related to changes of the network for fine-tuning and retraining (number of classes, resolution).
Errors related to newer software versions (conversion errors, missing definitions, renamed classes and options, etc.).
Fine-tuning problems and solutions
Changed number of training classes
When you change the number of training classes you will have to update the number of classes that are configured for the "NMS_TRT" plugin. If you do not change the "numClasses" parameter you would get the following error:
For the configuration, the correct number of classes to configure is one greater than the number of classes you have defined since you must include the background class. So for our example of 6 object classes we have to set:
When the resolution of the input images is changed this affects the location and scaling of the grid anchors that are used by the feature extractor. Without changing the featureMapShapes parameter of the TensorRT GridAnchor plugin you will get the following error:
To solve this problem, you have to determine the correct sizes of the feature maps. You can get those sizes by inspecting the generated graph (e.g. in TensorBoard) or using the tool below (trimmed to fit this page), which is based on this StackOverflow post.
Using this program we get the following for an image resolution of 450px:
Next, you take the first number of each tuple and add them to the GridAnchor configuration in your TensorRT configuration file, see for example the below settings:
Problems and solutions related to newer software versions
For networks that use the "NMS_TRT" plugin (e.g. MobileNet and other object detection networks) you have to specify the inputOrder parameter. This is a list of 3 integers (0, 1 and 2) where the 0,1,2 refer to the matching input of the node as defined in the network. If this order is incorrect it will result in a crash during parsing as sizes mismatch. This is the same error message that would result from using an incorrect feature map when you've changed the resolution, so be careful how you interpret that error!
To solve this problem, you can either try out all 6 combinations, or inspect the text version of the UFF. Here we do the latter. Open the UFF.pbtxt file and browse to the NMS node, and inspect the order of the ‘inputs’, for example:
Here we see that the inputs are in the following order: Squeeze, concat_priorbox, concat_box_conf.
The NMS plugin requires them in the following order: Squeeze, concat_box_conf, concat_priorbox.
So we have to remap the inputs such that the order is correct, to do this use the following parameter values: inputOrder=[0, 2, 1]
Unsupported operation _Cast
Newer versions of the Object Detection toolkit use a different name for the Input operation. The original mobilenet config is configured to look for the 'ToFloat' operation, but this has now been renamed as '_Cast'. Hence you will have to change/add the mapping to the namespace_plugin_map as follows:
This solves the following error:
Problems parsing GridAnchor
The final problem that we will address in this post is related to the GridAnchor. When using the latest versions of the Object Detection API and the UFF converter the resulting UFF file is missing an input element for the GridAnchor node. This results in a parsing failure with the following error:
This solves the following error:
To work around the problem of the missing input node you can manually define a constant input tensor and set that as the input for the GridAnchor node. In this example the values and dimensions of the constant tensor are based on the version that is used in the older versions of the object detection API, namely: [1, 1]. To create and add the node we use the graphsurgeon Python library that is already used to remove and rename nodes during the UFF conversion process. The snippet below shows these steps.
The above steps show an example of the problems one can encounter when trying to use a combination of closed and open-source software where the supported versions are not always kept in sync. We hope that others can benefit from the exercise we went through in order to debug and fix all the above reported problems.
We have tested the above on the following pre-trained, and then re-trained by us, networks:
The full configuration file that we used can be found here (note here we use the default settings for a network trained with the COCO dataset; 90 classes, 300x300 pixel resolution). This configuration file can be used in combination with the parse and build code in this repository.
The following configuration files contain all the above described fixes: