Notice

GitHUb

Recent Posts

Recent Comments

Link

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tags more

Archives

Today

Total

관리 메뉴

ComputerVision Jack

Objects at Points 본문

Reading Paper/Object Detection

Objects at Points

JackYoon 2022. 4. 14. 12:29

Objects at Points

Abstract

대부분의 성능 좋은 object detectors는 가능성 있는 방대한 object location 리스트를 열거하고 각각을 분류한다. 이러한 방법은 낭비이며, 비효율적이고 추가적인 post-processing 필요로 한다.
논문에선 다른 접근 방법을 취한다. 객체를 single point 추론할 수 있게 모델을 설계한다. 해당 detector는 center point 찾기 위해 keypoint-estimation 진행하고, 서로 다른 모든 object properies에 대해 regression 진행한다.
center point 기반의 접근 방법인 CenterNet은 end-to-end 구별이 가능하며 bounding box 기반인 detector 보다 간단하고 빠르며 더 정확하다. 이 방법은 정교한 multi-stage method와 함께 사용되며 real-time 동작한다.

Introduction

현재의 detectors는 axis-aligned boundig box 통해 object 아우르며 나타낸다. 그리고 방대한 객체 bounding box에서 image classification 통해 object detection 줄인다.
one-stage detectors 경우 anchors 사용하여 bounding box에 가능성이 있는 부분을 돈다. 그리고 직접적으로 분류를 진행한다. two-stage detectors 경우 potential box에 대해 image 재 연산하여 분류를 진행한다.
non-maxima suppression 불리는 post-processing은 bounding box의 IoU 통해 중복된 detections 줄인다. 이러한 post-processing은 학습하고 구별하기 어렵다. 따라서 현재의 detector는 end-to-end 학습이 가능하지 않다.
논문에선 객체를 bounding box의 중심 점인 single point 나타낸다. 다른 성질들 또한 center location의 image features에 대해 직접적으로 regression 된다.
heatmap 발생 시키는 fully connected network에 간단하게 input image 넣어준다. heatmap 내부의 peak는 object center와 상응하는 점이다. 각 peak의 image features는 객체의 bounding box의 height과 width 예측한다. 모델은 dense supervise learning 통해 학습된다. inference 또한 forward-pass 진행되고, 이 과정에서 non-maximal suppression이 post processing에 없다.

GitHub - xingyizhou/CenterNet: Object detection, 3D detection, and pose estimation using center point detection:

Object detection, 3D detection, and pose estimation using center point detection: - GitHub - xingyizhou/CenterNet: Object detection, 3D detection, and pose estimation using center point detection:

github.com

Related work

Object detection with implicit anchors.

저자의 접근은 anchor-based one-stage 접근과 비슷하다. center point가 single shape-agnostic anchor 같이 보일 수 있기 때문이다. 하지만 가장 큰 차이점은 다음과 같다.
- CenterNet은 location에 anchor 할당하고 box overlap 사용하지 않는다. 따라서 foreground와 background 사이의 threshold가 필요 없다.
- 객체마다 오직 하나의 positive anchor 갖는다. 따라서 non-maximum suppression(nms) 진행할 필요가 없다. 간단하게 keypoint heatmap에서 peaks 추출한다.
- CenterNet은 큰 output resolution 갖는다.

Object detection by keypoint estimation.

CornerNet은 keypoint 통해 2개의 bounding box 예측한다. 반면 ExtremeNet은 모든 객체의 (top, left, bottom, right)의 center points 사용한다.
그러나 두 방법 모두 keypoint detection 이후에 combinatiorial grouping stage가 필요하다. 이는 알고리즘이 느려지는 원인이다. 그러나 CenterNet은 grouping과 post processing 없이 간단하게 single center point 추출한다.

Monocular 3D object detection.

Deep3Dbox는 slow-RCNN 사용하여 2D 객체를 감지하고 3D 객체 검출을 진행한다. 3D RCNN은 Faster-RCNN에 head 추가하여 3D projection 진행한다. CenterNet은 이러한 방법의 one-stage 버전과 비슷하다. 하지만 이런 방법 보다 간단하고 빠르다.

PreLiminary

$I ∈ R^{W * H * 3}$ 대해 생각해보자. W와 H는 input size 나타내며, R은 stride, C는 keypoint type 나타낸다. keypoint type의 경우 인간 관절에 대한 부분은 C=17 이고 C=80 이면 object detection 범주이다.
또한 default R=4 stride 값을 사용하였다. 이 output stride는 output prediction에 대해 해당 값으로 downsample 한다. $Y_{x, y, c} = 1$이면 keypoint 검출과 상응하고, $Y_{x, y, c} = 0$이면 background 이다. Image(I)에서 predict(Y) 하기 위해 여러 다른 fully-convolutional encoder-decoder network 사용한다.
- up-convolutional residual networks (ResNet)
- deep layer aggregation (DLA)
연산을 진행한 다음에 Gaussian kernel 사용하여 ground truth keypoints 값을 heatmap 변환한다. 수식에서 $σ_p$는 object size-adaptive standard deviation이다. 만약 같은 class에 대해서 2개의 Gaussian overlap 갖는다면, element-wise 연산을 통해 maximum 값을 취한다.
pixel-wise 줄인 값에 objective 파악하기 위해 focal loss에 대한 logistic regression 진행한다.

output stride 인해 발생하는 discretization error 보완하기 위해 local offset Ο 각각의 center point에 추가한다. 모든 class c는 같은 offset prediction 공유한다. 이러한 offset은 L1 loss 통해 학습이 진행된다.

Object at Points

L1 loss에 대해 center point에 적용 시킨다. scale 값을 normalize 하지 않고 직접적으로 raw pixel에 대해 coordinates 구한다. 따라서 $λ_{size}$ 통해서 직접적으로 loss 조절한다. 전체적인 손실 함수는 아래와 같다.

network는 C + 4 output 각 location에 대해 예측한다.

From points to bounding boxes

inference 시간에 독립적으로 각 category의 heatmap에서 peaks 추출한다. keypoint value인 $Y_{x, y,c}$ 통해 detection confidence 측정하고 bounding box의 location 생산한다.
peak keypoint 추출은 효율적으로 NMS 대안으로 작동한다. 또한 3 x 3 max pooling 연산을 통해 device에서 효율적으로 실행될 수 있다.

3D detection.

3D detection 3개의 차원으로 bounding box 평가한다. 그리고 3개의 속성이 필요한데 이는 center point와 depth 이다. 따라서 별도의 head 구성하여 진행한다.
depth(d)는 center point의 single scalar 값이다. 그러나 depth는 직접적으로 regression 되기 어렵다. 따라서 output 에 대해 transformation 적용한다.
이전의 양식과 다르게 output layer에 inverse sigmoidal transformation 사용한다. 이후, origin depth domain에 L1 loss 적용하여 detph 추정한다.

Human pose estimation.

human pose estimation는 이미지에서 human instance에 대해 k 2D human joint locations 찾는 것을 목표로 한다. center point에 대해 pose가 k x 2-dimensional property 갖는다고 생각한다.
joint offset에 대해 L1 loss 사용하여 직접적으로 regression한다. 그리고 human joint heatmap은 focal loss 사용하여 훈련한다.

Conclusion

CenterNet object detector는 성공적인 keypoint estimation networks에 설계되었다. 그리고 객체의 center 찾고, 그 크기를 regress한다. 이 algoritm은 간단하고 빠르며 정확하다. 또한 어떠한 NMS와 같은 post-processing 없이 end-to-end 진행된다.
CenterNet은 추가적인 객체의 속성에 대해 추론할 수 있다. 여기서 속성은 pose 및 3D orientation, depth가 될 수 있다.

'Reading Paper > Object Detection' 카테고리의 다른 글

YOLOv3: An Incremental Improvement (0)	2022.04.14
Focal Loss for Dense Object Detection (0)	2022.03.31
Feature Pyramid Networks for Object Detection (0)	2022.03.31
YOLO 90000: Better, Faster, Stronger (0)	2022.03.30
SSD: Single Shot MultiBox Detector (0)	2022.03.30

'Reading Paper/Object Detection' Related Articles

Comments

ComputerVision Jack

Objects at Points 본문

Objects at Points

Objects at Points

'Reading Paper > Object Detection' 카테고리의 다른 글

티스토리툴바