Abstract
https://arxiv.org/pdf/2311.02274v1
Early object detection (OD) is a crucial task for the safety of many dynamic systems. Current OD algorithms have limited success for small objects at a long distance. To improve the accuracy and efficiency of such a task, we propose a novel set of algorithms that divide the image into patches, select patches with objects at various scales, elaborate the details of a small object, and detect it as early as possible. Our approach is built upon a transformerbased network and integrates the diffusion model to improve the detection accuracy. As demonstrated on BDD100K, our algorithms enhance the mAP for small objects from 1.03 to 8.93 , and reduce the data volume in computation by more than 77 % . The source code is available at https://github.com/destiny301/dpr
- Introduction
Object detection (OD) plays a vital role in numerous real-world applications, such as autonomous driving, and robotics. Despite the proliferation of diverse algorithms for this task, existing methods still face significant challenges in early object detection, a crucial aspect enabling prompt and proactive decision-making. In such scenarios, objects in captured images are often significantly reduced in size due to long distances. As illustrated in Fig. 1, when images contain only a limited number of objects, and the performance of object detection significantly deteriorates due to insufficient data volume.
To address this challenge, we can exploit superresolution (SR) algorithms to reconstruct the higherresolution images, thereby augmenting the data available for subsequent object detection models. SR is also a classic problem in computer vision, boasting a plethora of solutions tailored for this task. Recently, the diffusion models, such as DDPM [13], have showcased remarkable capabilities in image generation and demonstrated greater stability compared to generative adversarial networks (GAN) [10]. Moreover, research focusing on the application of Conditional Diffusion Models (CDM) [14,30], for SR has yielded notable advancements. Through the utilization of diffusion models for high-resolution image generation, we can achieve substantial enhancements in object detection performance, particularly for datasets with a low object-toimage ratio. However, diffusion models come with a significant computational cost, which poses a challenge for real-world applications like autonomous driving. From the image example in Fig. 1, the holistic refinement of the image results in a considerable computational burden on background pixels, leading to an excessive waste of resources that does not yield any meaningful contributions to OD.
In this paper, we introduce a novel algorithm, named Dichotomized Patch Refinement (DPR), to tackle the aforementioned problem. DPR leverages CDM to exclusively reconstruct patches that encompass objects, employing a Patch-Selector module for accurate patch classification. While the task of directly localizing small objects presents considerable challenges, discerning the presence or absence of objects within patches proves to be a more feasible approach. By leveraging the Patch-Selector module, we can efficiently filter out irrelevant patches that do not contribute to the subsequent OD task. This strategy significantly reduces the data volume, enabling the immediate generation of refined images using CDM to greatly enhance object detection accuracy. To facilitate the module’s implementation, we devise a hierarchical patch encoder inspired by the structure of the Swin Transformer [22] to extract embeddings for individual patches. Furthermore, we incorporate a patch classifier through the introduction of a classification token, following a similar approach to ViT [8]. Moreover, in line with our network’s hierarchical structure, we introduce a pyramid patch class label to ensure an ample inclusion of positive patches. Our experiments, conducted on the BDD100K dataset, provide compelling evidence of DPR’s efficacy and accuracy for early object detection.
To summarize, our key contributions are as follows:
- We design a Patch-Selector module, incorporating the attention mechanism, to effectively sift desired patches containing objects from images. Moreover, we introduce a hierarchical architecture and employ a pyramid loss function to further improve the selection process.
- By harnessing the capabilities of Conditional Diffusion Models (CDM), we effectively refine solely the selected patches, yielding enhanced performance in object detection.
- By enlarging negative patches with interpolation, we seamlessly combine all processed patches to form complete images. Through comprehensive experiments on both patches and entire images, we demonstrate that our DPR achieves competitive early object detection performance with 77.2 % reduction of the computation.
- Related work
2.1. Diffusion Models for Image SR
Initially, ConvNets gained prominence in image superresolution [18,23] , particularly with the seminal work on the SRCNN model [7]. However, the introduction of generative adversarial networks (GAN) by Goodfellow et a l . [10] revolutionized the field, offering unprecedented image generation capabilities. GAN-based SR methods [5,16,17,19,34] , have since become prevalent. These techniques employ game theory-inspired competition between a generator and a discriminator to drive iterative improvements and generate high-quality images. Nonetheless, challenges related to training stability and model convergence persist in GAN-based SR methods.
Instead, diffusion models [31] have demonstrated superior performance in image generation and exhibit enhanced
stability. The introduction of DDPM by Jonathanet al. [13], has further popularized the use of diffusion models in the field of image generation, displacing the reliance on GANs. Additionally, recent research has focused on techniques for fast sampling [11, 15, 24, 25, 36, 37]. DDIM [32] accelerates the sampling process by 10 \times to 50 \times through the introduction of a more efficient class of implicit probabilistic models. Given the remarkable performance of diffusion models in image generation, several studies have explored their application in SR by leveraging CDM. For instance, Saharia et al. proposed SR3 [30], which demonstrates improved SR performance based on CDM. Similarly, Jonathan et al. introduced Cascaded Diffusion Models [14], which further advances the field of SR.
2.2. Object Detection (OD)
Traditional methods for OD, such as Faster RCNN [9], primarily rely on convolutional layers. The introduction of anchor boxes in Faster R-CNN, a two-stage OD algorithm, has significantly transformed conventional methodologies. Consequently, numerous convolution-based methods, such as YOLO [2,26-28], Mask R-CNN [12], have emerged and continually improved performance in OD.
Furthermore, the attention mechanism [33], initially introduced in ViT [8] for image classification, has been widely adopted in various computer vision tasks, including OD. This is primarily due to the transformer’s ability to model long-range dependencies. Carion et al. proposed DETR [3], which formulates OD as a direct set prediction problem and employs a transformer encoder-decoder network. DINO [4], introduced by Caron et al., leverages self-supervised learning to develop a new transformer network based on ViT. To reduce computation, Liu et al. proposed Swin Transformer [21,22], which incorporates a novel window-based self-attention mechanism. Inspired by BERT [6] in natural language processing, Bao et al. presented BEiT [1] for computer vision applications.
3. Methodology
As illustrated in Fig. 2, DPR comprises three crucial modules: Patch-Selector, Patch-Refiner, and PatchOrganizer. The Patch-Selector module is responsible for extracting patch features and performing classification. Subsequently, the Patch-Refiner module elaborates on the positive patches, leveraging CDM to reconstruct them to a higher resolution, thereby enhancing object detection precision. Lastly, to completely show the efficiency and accuracy of our proposed method, we employ inexpensive interpolation techniques to enlarge the negative patches and organize all patches into entire images to facilitate a direct comparison with the original images. In this section, we provide a detailed discussion of all the modules, and outline the specific training procedures of DPR, which are presented in Algorithm 1. Additionally, Algorithm 2 elucidates the sampling and testing processes.
3.1. Patch-Selector
Network architecture. This module splits the image into 8 \times 8 patches non-overlapping and classifies them to determine if it contains objects or not. Specifically, as depicted in Fig. 3, the input image, \boldsymbol{I}{i n} \in \mathbb{R}^{H{i n} \times W_{i n} \times C_{i n}} ( H_{i n}, W_{i n} , and C_{i n} are the input image height, width and the number of channels), undergoes a hierarchical patch encoder comprising multiple Transformer Layers (TL). This process generates patch representations at three different scales, namely \boldsymbol{r}{1} \in \mathbb{R}^{\frac{H}{2} \times \frac{W}{2} \times 2 C}, \boldsymbol{r}{2} \in \mathbb{R}^{\frac{H}{4} \times \frac{W}{4} \times 4 C} , \boldsymbol{r}_{3} \in \mathbb{R}^{\frac{H}{8} \times \frac{W}{8} \times 8 C} , as the following equations,
\begin{array}{l}
\boldsymbol{r}{1}=T L{1}\left(T L_{0}\left(E L\left(\boldsymbol{I}{i n}\right)\right)\right) \
\boldsymbol{r}{2}=T L_{2}\left(\boldsymbol{r}{1}\right) \
\boldsymbol{r}{3}=T L_{3}\left(\boldsymbol{r}_{2}\right)
\end{array}
where T L_{i} denotes the i th Transformer Layer, and E L is the embedding layer at the beginning of the network. H, W depend on the patch size.
Our TL is similar to the Swin Transformer structure, and it consists of three components: one feature merging layer for representation down-sampling, one windowbased multi-head self-attention block (W-MSA), and another shifted window-based multi-head self-attention block (SW-MSA) to capture information across windows. Specifically, W-MSA splits the input feature into n * n nonoverlapping windows, where n depends on the window size and feature size, and captures global contextual information within each window. W-MSA solely considers connections within each window, potentially missing out on connections across windows. To address this limitation, SW-MSA shifts the feature by the half of window size before partitioning to enable the cross-window connections.
Specifically, the features \boldsymbol{r}{1}, \boldsymbol{r}{2} , and \boldsymbol{r}{3} correspond to patches of size \frac{2 H{i n}}{H} \times \frac{2 W_{i n}}{W}, \frac{4 H_{i n}}{H} \times \frac{4 W_{i n}}{W} and \frac{8 H_{i n}}{H} \times \frac{8 W_{i n}}{W} , respectively. To classify these patches, we compute the cross-attention with the learnable classification token, denoted as \boldsymbol{c} . The computation can be expressed as follows:
\begin{array}{c}
\boldsymbol{Q}{i}=\boldsymbol{r}{i} W_{i}^{q}, \quad \boldsymbol{K}{i}=\boldsymbol{c} W{i}^{k}, \quad \boldsymbol{V}{i}=\boldsymbol{c} W{i}^{v} \quad \forall i \in 1,2,3 \
\boldsymbol{A}{i}=\operatorname{softmax}\left(\frac{\boldsymbol{Q}{i} \boldsymbol{K}{i}^{T}}{\sqrt{d}}\right) \boldsymbol{V}{i} \quad \forall i \in 1,2,3
\end{array}
where W_{i}^{q}, W_{i}^{k} , and W_{i}^{v} are linear layer weights for query, key, and value matrices.
Next, the features are passed through a multi-layer perceptron (MLP) and a softmax layer to predict the class for each patch as follows,
\boldsymbol{s}{i}=\operatorname{softmax}\left(M L P{i}\left(\boldsymbol{A}_{i}\right)\right) \quad \forall i \in 1,2,3
where M L P_{i} denotes the output layer for the i th feature embeddings.
Accordingly to the network structure, we introduce a pyramid label that contains, \boldsymbol{y}{1} \in \mathbb{R}^{\frac{H}{2} \times \frac{W}{2} \times 1}, \boldsymbol{y}{2} \in \mathbb{R}^{\frac{H}{4} \times \frac{W}{4} \times 1} , and \boldsymbol{y}_{3} \in \mathbb{R}^{\frac{H}{8} \times \frac{W}{8} \times 1} , to supervise the training of Patch-Selector module by assigning positive labels to the patches that contain objects.
To minimize information loss, the final prediction is derived by selecting the maximum value from three scales after up-sampling to the same size with bilinear interpolation, ensuring the retention of a greater number of patches.
Loss Function. The loss for each patch is computed using cross-entropy. To incorporate predictions from the hierarchical network at three scales and reduce false negative (FN) predictions, we introduce a combined loss formulation. This formulation involves the weighted sum of individual losses and can be expressed as follows
\mathcal{L}{P}=\sum{i=1}^{3}\left(-\boldsymbol{y}{i} \log \left(\boldsymbol{s}{i}\right)-\beta\left(1-\boldsymbol{y}{i}\right) \log \left(1-\boldsymbol{s}{i}\right)\right)
where \beta is a hyper-parameter to adjust weight, and we set it to 0.01 in our experiments.
3.2. Patch-Refiner
Depending on the patch class, different refinement approaches are employed. For positive patches, the conditional diffusion models (CDM) reconstructs them with finer details. Conversely, negative patches are scaled up using simpler up-sampling methods, such as bilinear interpolation (BI), in the Enlarge module.
CDM. Diffusion Models consist of a forward process that progressively corrupts the input data over T timesteps by keeping adding Gaussian noise, and a reverse process to restore the original data from the final corrupted data. And for CDM, the reconstruction of the corrupted data is performed based on an additional signal that is related to the original data, such as a lower-resolution image in the context of super-resolution (SR).
Let \boldsymbol{z} \in \mathbb{R}^{H_{p} \times W_{p} \times C_{p}}\left(H_{p}, W_{p}, C_{p}\right. are the patch height, width, and the number of channels) denote the lowresolution patches we obtain from the Patch-Selector module while \boldsymbol{x}{0} \in \mathbb{R}^{8 H{p} \times 8 W_{p} \times C_{p}} is high-resolution data. Then, the forward process of our CDM is adding Gaussian noise to x_{0} over T steps as follows,
\begin{aligned}
q\left(\boldsymbol{x}{t} \mid \boldsymbol{x}{t-1}\right) & =\mathcal{N}\left(\boldsymbol{x}{t} ; \sqrt{1-\beta{t}} \boldsymbol{x}{t-1}, \beta{t} \mathbf{I}\right) \
q\left(\boldsymbol{x}{1: T} \mid \boldsymbol{x}{0}\right) & =\prod_{t=1}^{T} q\left(\boldsymbol{x}{t} \mid \boldsymbol{x}{t-1}\right) \
& =\mathcal{N}\left(\boldsymbol{x}{t} ; \sqrt{\alpha{t}} \boldsymbol{x}{0},\left(1-\overline{\alpha{t}}\right) \mathbf{I}\right)
\end{aligned}
where \alpha_{1: T}, \beta_{1: T} are hyper-parameters, subject to 0< \alpha_{t}<1, \alpha_{t}+\beta_{t}=1 , and \overline{\alpha_{t}}=\prod_{i=1}^{t} \alpha_{i} . They determine the variance of the noise added at each iteration. And \overline{\alpha_{t}} should be small enough, so that the final signal x_{T} we acquire after the forward process is roughly also a standard Gaussian noise.
To gradually recover the original data from the final noise, the CDM model f_{\phi}\left(\boldsymbol{z}, \tilde{\boldsymbol{x}}{t}, t\right) is trained to predict the added noise in each step with the input of low-resolution image \boldsymbol{z} , noisy image \tilde{\boldsymbol{x}}{t} , and t , where the noisy image at timestep t could be obtained from Eq. (10):
\tilde{\boldsymbol{x}}{t}=\sqrt{\overline{\alpha{t}}} \boldsymbol{x}{0}+\left(1-\overline{\alpha{t}}\right) \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})
And for the reverse sampling process, the model recovers the high-resolution patch x_{0} from x_{T} conditioned on z with the following equations,
p_{\phi}\left(\boldsymbol{x}{t-1} \mid \boldsymbol{x}{t}, \boldsymbol{z}\right)=\mathcal{N}\left(\boldsymbol{x}{t-1} ; \mu{\phi}\left(\boldsymbol{z}, \tilde{\boldsymbol{x}}{t}, t\right), \sigma{t}^{2} \mathbf{I}\right)
We set the variance \sigma_{t}^{2} \mathbf{I} to \beta_{t} , and we could compute the mean with the estimated noise from CDM model as follows,
\mu_{\phi}\left(\boldsymbol{z}, \tilde{\boldsymbol{x}}{t}, t\right)=\frac{1}{\sqrt{\alpha{t}}}\left(\boldsymbol{x}{t}-\frac{1-\alpha{t}}{\sqrt{1-\overline{\alpha_{t}}}} f_{\phi}\left(\boldsymbol{z}, \tilde{\boldsymbol{x}}_{t}, t\right)\right)
Finally, the iterative elaboration process is done with the following equation:
\boldsymbol{x}{t-1} \leftarrow \frac{1}{\sqrt{\alpha{t}}}\left(\boldsymbol{x}{t}-\frac{1-\alpha{t}}{\sqrt{1-\overline{\alpha_{t}}}} f_{\phi}\left(\boldsymbol{z}, \tilde{\boldsymbol{x}}{t}, t\right)\right)+\beta{t} \boldsymbol{\epsilon}_{t}
where \boldsymbol{\epsilon}_{t} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})
Enlarge. In real-world applications, we discard all negative patches since they do not contribute to the subsequent object detection (OD) task. However, this approach can compromise the integrity of the dataset labels, which in turn affects the fairness of experimental comparisons. To ensure a fair evaluation and demonstrate the effectiveness of our approach, we perform scaling on these negative patches using bilinear interpolation (BI), nearest interpolation, or bicubic interpolation, thereby matching them to the same resolution as the positive patches.
3.3. Patch-Organizer
By leveraging this module, we combine all the refined positive and negative patches based on their original locations (i.e., the indices of the x -axis and y -axis in the output of Patch-Selector), resulting in the generation of entire images to provide further evidence of the advancements achieved by our DPR algorithm, accompanied by reduced computational requirements.
4. Experiments
4.1. Dataset and Training Details
As described in Sec. 1, we evenly partition the BDD100K dataset [35] based on the ratio of object pixels into several subsets to test OD, and we select a subset of small ratio to simulate the early detection scenario, where distant objects are typically smaller in size. Our algorithm primarily focuses on enhancing OD performance for the subset with the longest distance, which is named FBDD and consists of images with an object pixel ratio of less than 1.5 % . And we select another subset named NBDD, which contains larger objects with a foreground pixel ratio ranging from 15 % to 23 % , for model fine-tuning. Both subsets contain around 4000 training images and about 1000 validation images with the original size of 1280 \times 720 .
For Patch-Selector optimization, we resize all the images to be 1024 \times 1024 before inputting them to the model. The first embedding layer utilizes a kernel and stride size of 16 , with a channel number of 96 . We set the learning rates to 0.001 for the convolution-based network and 0.00001 for the attention-based network. For each TL, the depth, window size, and attention head number are set to 2,7,3 . To align with the hierarchical network structure, we introduce a pyramid label that encompasses three scales: 8 \times 8,16 \times 16 , and 32 \times 32 . The patch selection results from the three different scales are then aggregated to a output resolution of 8 \times 8 . Once this module gets optimized, input images are resized to be 128 \times 128 or 64 \times 64 for training.
We mainly train the CDM to upscale the 16 \times 16 patches to 128 \times 128 in 1000 timesteps for OD evaluation, although our results show that it also performs well for larger resolution reconstruction. The network architecture is based on U-Net [29], with parameters similar to SR3 [30]. We conduct OD testing using YOLOv8, a state-of-the-art OD algorithm. We experimented with two NVIDIA A6000 GPUs.
4.2. CDM for Patch Refinement
We perform an extensive evaluation of the CDM for patch refinement, comparing its performance against BI. In Tab. 1, we present the results for 4 different scales with BI or CDM. In all cases, the high-resolution output is 128 \times 128 given various input low-resolution patches. We evaluate metrics such as Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM), which measure the similarity between the generated patches and the ground-truth high-resolution patches. Additionally, we employ Fréchet Inception Distance (FID) and Kernel Inception Distance (KID) to compare the features extracted from these patches for image classification. Furthermore, we measure the mean Average Precision (mAP), which serves as the evaluation metric for our primary objective, OD. Generally, the patches generated by CDM outperform those from BI across all metrics. Despite that 32 \times 32 patches from BI exhibit a higher similarity to the original patches, as indicated by the results of PSNR and SSIM, their features for image classification and OD are still inferior to those generated by CDM. The superiority of CDM is explicitly demonstrated with the mAP comparison in Fig. 4a.
In addition, we create some new datasets for OD evaluation by gradually substituting the original high-resolution patches with the processed patches obtained from BI or CDM. As shown in Fig. 4b, the OD performance exhibits a more pronounced improvement with an increased proportion of processed patches from CDM. This observation underscores the notable advantages offered by CDM.
4.3. Patch Selection
Architecture of Patch-Classifier. While we have established the viability of CDM for SR, the challenge lies in accurately selecting patches containing objects. Achieving superior performance in the subsequent OD task requires careful consideration of the true positive rate (TPR) during the patch selection stage, as any irreversible information loss at this stage can severely degrade the detection performance. To address this, we utilize multiple transformer layers as the encoder to generate patch embeddings. And our primary focus is on the design of the Patch-Classifier module, which determines the presence of objects in each patch. The impact of the adopted techniques in the design of the PatchSelector Module is presented in Table 2. Initially, we employed convolution layers (Conv-C), which yielded satisfactory results on NBDD dataset. However, by introducing a learnable token and utilizing cross-attention, we achieved even better performance (Attention-C). Moreover, by incorporating a hierarchical network structure and pyramid label (Attention-PC), we observed further improvements across all metrics, particularly in terms of TPR. Comparatively, convolution-based networks also benefited from the hierarchical structure and pyramid label (Conv-PC), but they were unable to match the performance of the attention-based.
Aggregation and pyramid loss. The results in the sixth row (Attention-AC) of Tab. 2 demonstrate that incorporating an aggregation block reduces information loss, as evidenced by the higher TPR. Furthermore, by modifying the loss function to place greater emphasis on positive patches, we observed further improvements in TPR, as shown in the seventh row (Attention-WC). With our final Patch-Selector architecture, we achieved a decent TPR for the FBDD dataset, as indicated in the last row of the table.
Model size. We explore different model sizes for the Patch-Selector module, specifically using 4, 5, or 6 transformer layers. In Tab. 3, utilizing a network with only 4 transformer layers can achieve equivalent performance in patch selection while reducing FLOPs to 5.01 % .
4.4. Comparison of OD Performance
To fully demonstrate the merit of our approach, we not only detect objects from patches with bounding box labels different from the original image due to patch partitioning for OD performance comparison, but we also integrate the entire images for detection. As we scale the 16 \times 16 patches to 128 \times 128 , we use the results obtained by directly feeding the 16 \times 16 patches into the OD model as the baseline for patch-wise detection. We compare the performance of our DPR with this baseline as well as other methods. Additionally, since we have 8 \times 8 patches, the entire image is scaled from 128 \times 128 to 1024 \times 1024 . Similarly, we use the results of the low-resolution 128 \times 128 image as the baseline for image-wise detection.
Patch-wise detection. Besides our approach, we generate high-resolution patches with another two methods, BI and SR3 [30], for comparison. BI simply scales up all patches to 128 \times 128 using bilinear interpolation while SR3 is a conditional diffusion model (CDM) based on DDPM that performs entire image super-resolution (SR). The PP columns in Tab. 5 present the results when we feed only positive patches to OD, which is the approach we adopt in real-world applications. For BI and SR3, we assume that they can perfectly select positive patches (i.e., TPR is 1 ). The PP columns show our DPR performs the best.
To address the potential unfairness in the previous experiments, we also evaluate OD with both positive and negative patches, shown in the AP columns. As mentioned in Sec. 3, negative patches of DPR are enlarged with BI. Additionally, we conduct an experiment where all negative patches are replaced with black patches to simulate the removal of negative patches, denoted as DPR-B. DPR achieves comparable performance to SR3 with significantly fewer average refined patches of each image ( 14.59 on average versus 64). This highlights the computational efficiency of our approach. Interestingly, DPR-B outperforms DPR, suggesting that the selection results of our Patch-Selector module contribute to OD. By excluding the negative patches, which may introduce noise and confusion, our approach focuses solely on the positive patches, leading to improved detection results.
Image-wise detection. Figure 5 shows the visual comparison of BI and our DPR after integrating patches. While the overall generated images from DPR appear similar to BI, the crucial patches containing objects exhibit finer details, indicating that only a small amount of data need to be processed by CDM, leading to more efficient computation.
Quantitative results for OD are presented in Tab. 4. To compare with another SR method, SwinIR [20], and maintain consistency, we align our evaluation with SwinIR’s setting, upscaling images from 64 \times 64 to 512 \times 512 . We show the results of ground truth high-resolution images in the table as the upper bound. SR3 can perform much better than transformer-based SwinIR. DPR achieves the highest mAP among all the methods, resulting in an improved mAP from 0.194 to 4.33 , while DPR can enhance mAP from 1.03 to 8.93 when upscaling images from 128 \times 128 to 1024 \times 1024 .
Efficiency of our approach. To trade off the computation and performance, we experiment with various thresholds for patch classification when upscaling images from 64 \times 64 to 512 \times 512 in Tab. 6 . The second row, yielding mAP of 4.33 , stands out as the optimal choice, achieving 63 % computation reduction.
For FBDD up-sampling from 128 \times 128 to 1024 \times 1024 with the same threshold, our PS module outputs only 22.8 % patches for CDM generation and OD, and the FLOPs of PS are negligible compared to CDM, which means we save 77.2 % computation compared to full-image generation, as demonstrated in Tab. 5.
- Conclusion
In this paper, we propose a novel Dichotomized Patch Refinement algorithm (DPR) to efficiently enhance the OD performance by selectively reconstructing the highresolution patches of images with conditional diffusion models. With a hierarchical transformer-based network and pyramid loss function, positive patches containing objects are accurately located. With patch-wise CDM, lowresolution positive patches are significantly refined, thereby improving the performance of the subsequent OD task. And the experimental results on the BDD100k dataset show that DPR effectively improves the mAP for early object detection from 1.03 to 8.93 with only 22.8 % computation.
Acknowledgments
This work is supported by the Center for the Co-Design of Cognitive Systems (CoCoSys), one of seven centers in Joint University Microelectronics Program 2.0 (JUMP 2.0), a Semiconductor Research Corporation (SRC) program sponsored by the Defense Advanced Research Projects Agency (DARPA).
本站资源均来自互联网,仅供研究学习,禁止违法使用和商用,产生法律纠纷本站概不负责!如果侵犯了您的权益请与我们联系!
转载请注明出处: 免费源码网-免费的源码资源网站 » Patch-based Selection and Refinement for Early Object Detection
发表评论 取消回复