此处罗列多篇cvpr2021论文的创新点以观察规律。

优秀论文创新点整理

CVPR (2021) - Human Pose Estimation

We argue that the representations for regressing the positions of the keypoints accurately need to focus on the keypoint regions. (A New Observation)
The proposed DEKR approach is able to learn disentangled representations through two simple schemes, adaptive convolutions and multi-branch structure, so that each representation focuses on one keypoint region and the prediction of the corresponding keypoint position from such representation is accurate. （Advantages, Proposed Network）
The proposed direct regression approach outperformskeypoint detection and grouping schemes and achieves new state-of-the-art bottom-up pose estimation results on the benchmark datasets, COCO and CrowdPose. （Experimental Results - SOTA）

We simply apply the shuffle blocks to HRNet, leading a lightweight network naive Lite-HRNet. We empirically show superior performance over MobileNet, ShuffleNet, and Small HRNet. （Proposed Network）
We present an improved efficient network, Lite-HRNet. The key point is that we introduce an efficient conditional channel weighting unit to replace the costly 1 × 1 convolution in shuffle blocks, and the weights are computed across channels and resolutions. （Proposed Network）
Lite-HRNet is the state-of-the-art in terms of complexity and accuracy trade-off on COCO and MPII human pose estimation and easily generalized to semantic segmentation task. （Experimental Results - SOTA）

We propose a regression-based human pose recognition method by building cascade Transformers, based on a general-purpose object detector, end-to-end object detection Transformer (DETR). Our method, named pose recognition Transformer (PRTR), enjoys the tokenized representation in Transformers with layers of selfattention to capture the joint spatial and appearance modeling for the keypoints. (Proposed Network, Summarize)
Two types of cascade Transformers have been developed: 1). a two-stage one with the second Transformer taking image patches detected from the first Transformer, as shown in Figure 2; and 2). a sequential one using spatial Transformer network (STN) to create an end-to-end framework, shown in Figure 3. (Proposed Network, Respective, Detail)
We visualize the distribution of keypoint queries in various aspects to unfold the internal process of the Transformer for the gradual refinement of the detection. (Explainability)
结果没达到 SOTA 故没体现在创新中

To the best of our knowledge, this is the first paper that focuses on the problems in heatmap regression when tackling large variance of human scales and labeling ambiguities. We attempt to alleviate these problems by scale and uncertainty prediction. （New Problem）
We propose a scale-adaptive heatmap regression (SAHR), which can adaptively adjust the standard deviation of the Gaussian kernel for each keypoint, enabling the model to be more tolerant of various human scales and labeling ambiguities. (Proposed Network, Main)
We propose a weight-adaptive heatmap regression (WAHR) to alleviate the severe imbalance between foreground and background samples. It could automatically focus more on relatively harder examples and fully exploit the superiority of SAHR. (Another Work)
Our model outperforms the state-of-the-art model by 1.5 AP and achieves 72.0 AP on COCO test-dev2017, which is comparable with the performances of most top-down methods. （Experimental Results - SOTA）

We introduce a conceptually simple but effective method to learn 2D human-interpretable keypoints based on transforming a single manually defined 2D template. (Proposed Method)
Our proposed approach is capable of performing 2D human pose estimation without any additional need for labeled data, either paired or unpaired. (Advantages)
We demonstrate the high adaptability of our approach by evaluating it on benchmark data and in the wild on a challenging infant pose estimation dataset. （Experimental Results - SOTA）

We propose three robust benchmarks COCO-C, MPIIC, and OCHuman-C, and demonstrate that both topdown and bottom-up pose estimators suffer severe performance drop on corrupted images, drawing the community’s attention to this problem. (New Benchmarks)
With extensive experiments, we have many interesting conclusions that would help improve the accuracy and robustness of future works. (New Observations)
We propose a novel adversarial data augmentation method together with knowledge distillation, termed AdvMix, which is model-agnostic and easy-toimplement. It significantly improves the robustness of pose estimation models while maintaining or slightly improving the performance on the clean data, without extra inference computational overhead. (Proposed Method)

Quartz 4