EfficientDet: Scalable and Efficient Object Detection
Authors : Google Research Brain Team
Mingxing Tan Ruoming Pang Quoc V. Le
The summary contains the most important parts of the paper .
Summary :
The authors study architectures and optimizations of paper and propose a few methods to increase accuracy while decreasing memory .
They propose a BiFPN - Bi-Direction Feature Pyramid.
They propose compounds scaling to scale resolution , width , depth of backbone ,feature network and box/class prediction network.
Efficient D7 outperforms the SOTA models with fewers flops.
Work:
The paper tackles the problem of building a scalable detection module with higher accuracy and better efficiency for a ll kinds of devices from resource constrained to non-constrained ?
Challenge 1 : Efficient multi-scale feature fusion
FPN (Feature Pyramid Network) are used for multiscale fusion in a wide variety of networks . While in previous works simply sum different feature input features but most networks just sum them but each input feature contributes differently to fused output.
To Address this issue, we propose a simple yet highly effective weighted bi-directional feature pyramid network (BiFPN),which introduces learnable weights to learn the importance of different input features, while repeatedly applying top-down and bottom-up multi-scale feature fusion.
Challenge 2 : model scaling
While most models rely on bigger backbones and bigger resolutions sizes for higher accuracy , it is observed that it also increases on jointly scaling up the resolution/depth/width for all backbone, feature network, box/class prediction networks.
BiFPN :Efficient bidirectional cross-scale connections and weighted feature fusion
The conventional FPN aggregates multi-scale features in a top-down manner:
Pout7=Conv(Pin7)
Pout6=Conv(Pin6+Resize(Pout7))...
Pout3=Conv(Pin3+Resize(Pout4))
Where Resize is usually a upsampling or downsampling output for resolution matching, and Conv is usually a convolutional output for feature processing.
This paper proposes several optimizations for cross-scale connections: First, we remove those nodes that only have one input edge. Our intuition is simple: if a node has only one input edge with no feature fusion, then it will have less contribution to the feature network that aims at fusing different features. This leads to a simplified bi-directional network; Second, we add an extra edge from the original input to output node if they are at the same level,in order to fuse more features without adding much cost;Third, unlike PANet [26] that only has one top-down and one bottom-up path, we treat each bidirectional (top-down bottom-up) path as one feature network layer, and repeat the same layer multiple times to enable more high-level fea-ture fusion. The number of repetitions is dependent on the device.
Weighted Fusion :
It is observed that since different input features are at different resolutions, they usually contribute to the output feature unequally. To address this issue, we propose to add an additional weight for each input, and let the network learn the importance of each input feature. Based on this idea, we consider three weighted fusion approaches
Unbounded fusion: O=∑iwi·Ii, where i is a learnable weight that can be a scalar (per-feature), a vec-tor (per-channel), or a multi-dimensional tensor (per-pixel).We find a scale can achieve comparable accuracy to other approaches with minimal computational costs. However,since the scalar weight is unbounded, it could potentially cause training instability. Therefore, we resort to weight normalization to bound the value range of each weight.
Softmax-based fusion:O=∑iewi∑jewj·Ii. An intuitive idea is to apply softmax to each weight, such that all weights are normalized to be a probability with values ranging from 0to 1, representing the importance of each input. However, as shown in our ablation study in section 6.3, the extra soft-max leads to significant slowdown on GPU hardware. To Minimize the extra latency cost, we further propose a fast fusion approach
Fast normalized fusion:O=∑iwi+∑jwj·Ii, where 0≥0is ensured by applying a Relu after each wi, and= 0.0001is a small value to avoid numerical instability. Similarly, the value of each normalized weight also falls between 0 and 1, but since there is no softmax operation here, it is much more efficient. Our ablation study shows this fast fusion approach has very similar learning behavior and accuracy as the softmax-based fusion, but runs up to30% faster on GPUs.
Our final BiFPN integrates both the bidirectional cross-scale connections and the fast normalized fusion. As a concrete example, here we describe the two fused features at level 6 for BiFPN shown in Figure
2(d):
Ptd6=Conv(w1·Pin6+w2·Resize(Pin7)w1+w2+)
Pout6=Conv(w′1·Pin6+w′2·Ptd6+w′3·Resize(Pout5)w′1+w′2+w′3+)
wherePtd6 is the intermediate feature at level 6 on the top-down pathway, andPout6is the output feature at level 6 on the bottom-up pathway. All other features are constructed in a similar manner. Notably, to further improve the efficiency, we use depthwise separable convolution [7, 37] forfeiture fusion, and add batch normalization and activation after each convolution
Efficient Det architecture :
We employ ImageNet-pretrained EfficientNets as the backbone network. Our proposed BiFP serves as the feature network, which takes level 3-7 features{P3,P4,P5,P6,P7}from the backbone network and repeatedly applies top-down and bottom-up bidirectional feature fusion. These fused features are fed to a class and box network to produce object class and bounding box predictions respectively. Similar to [24], the class and box net-work weights are shared across all levels of features.
They Propose a new com-pound scaling method for object detection, which uses a simple compound coefficientφto jointly scale up all dimensions of backbone , BiFPN, class/box network, and resolu-tion. Unlike [39], object detectors have much more scaling dimensions than image classification models, so grid search for all dimensions is prohibitively expensive. Therefore, we use a heuristic-based scaling approach, but still follow the main idea of jointly scaling up all dimensions
Formally, BiFPN width and depth are scaled with the following equation:
Wbifpn= 64·(1.35φ), Dbifpn= 3 +φ
Box/class prediction network –we fix their width to be always the same as BiFPN (i.e.,Wpred=Wbifpn), but linearly increase the depth (#layers) using equation:Dbox=Dclass= 3 +bφ/3c(2)Input image resolution –Since feature level 3-7 are used in BiFPN, the input resolution must be dividable by 27=128, so we linearly increase resolutions using equation:
Rinput= 512 +φ·128.
Conclusion : The efficient set paper gives us a improved and scalable architecture for object detection and semantic segmentation which can be run on single backbone network to decrease computational overhead. It achieves SOTA results while also being scale backed for resource constrained devices . They provide BiFPN which includes a weighted factor as well as compounding scaling to increase accuracy. Fast Normalized function is also an output of this paper providing similar results with lower compute for softmax.
Comments