bounding box
ground truth
4 number to decide a bounding box $$ (x_{left-up}, y_{left-up},x_{right-down}, y_{right-down})\ (x_{left-up}, y_{left-up},width, height) $$ for drawing, another kind $$ (x_{center}, y_{center}, width, height) $$ Dataset
- every row represent a object
- File_name, object_class, bounding box(4 value)
- COCO dataset
- 80 object class, 330K picture, 1.5M objects
anchor box
A lot of anchor-box-based algorithm.
- a series of predictive box.
- predict whether every anchor box has interested object
- if yes, predict the offset between anchor box and bounding box
Every anchor box is a training sample.
anchor box is either negative sample or related to a bounding box
We will generate huge amount of anchor box,
- this will lead to a lot of negative anchor box
- a matrix to record IOU. two dimension (anchor box index, bounding box index). Assign the largest IOU (global) to matched bounding box.
IOU
-
Intersection over Union (IOU) is to compute the similarity between two boxes.
- 0 is no overlap, 1 is full overlap
-
A special case of Jacquard index
- two given set A and B $$ J(A, B) = \frac{|A\cap B|}{|A\cup B|} $$
NMS
Non-Maximum Suppression
- Every anchor box predict a bounding box
- NMS can merge those similar anchor boxes
- select the largest predict value(confident) and class is not negative(background)
- Abondon other boxes which IOU larger than a threshold
- repeat these steps
How to generate anchor boxes?
generate those anchor boxes with widths and heights as $ws\sqrt{r}$ and $hs/\sqrt{r}$, where w, h is width and height of the picture, s is scale of anchor box, r is the ratio. It is over all the pixel in picture.
Common method is to consider best scale $s_1$ and with all the $r$, and $r=1$ with all $s$ $$ (s_1, r_1), (s_1, r_2),…,(s_1, r_m),(s_2, r_1), (s_2, r_1),…,(s_n, r_1) $$ number is n+m-1
R-CNN
Region-CNN (R-CNN)
- Selective search to select anchor box
- pre-trained model to get features from every anchor box
- train a SVM to classify
- train a Linear regression model to predict the offset fo every box
ROI pooling
to get different size anchor boxes batched
Fast RCNN
- get feature from CNN
- find feature map by same-proportion dividing of anchor box
- RoI pooling
- Full-connected layer
Faster RCNN
Two stage. Use region proposal network(RPN) to substitute the selective search
RPN
input: global feature map
output: high-quality anchor box after NMS
RPN: use CNN to binary classify anchor boxes and regression the offset
map is high, but very slow
Mask R-CNN
After RoI align, use fully convolutional network(FCN) to get mask prediction
do semantic segmentation
SSD
single shot detection
Multi-scale feature
without RPN, all anchor boxes are predicted
Lower stage to detect the small object, upper stage to detect the large object
ssd is fast, but map is very low.
YOLO
you only look once
SSD have those anchor boxes overlapped. A large waste of computation.
- uniformly separate picture into SxS anchor boxes.
- Predict anchor box B bounding box. because anchor box may include multiple objects.
- faster than SSD