object detection

How does it work

An object detection algorithm is given a set of images containing the object of interest, and a set of images not containing the object of interest. Given enough images, the detection algorithm learns to recognize the underlying features of the object. Any new image can then be assessed by the algorithm, which will return positive matches of the object, if it exists.

With OpenCV, quite an extensive set of utility tools for object detection exist. An introductory example is given below in steps.

Building a dataset

# find a collection of images without the object of interest
# place them in a folder of your own choice. Each image must
# be labeled with an index, and the given folder name.

negatives/
  /negatives1.png
  /negatives2.png
  ...

# create a file.txt file referencing this folder
# file.txt content

negatives/negatives1.png
negatives/negatives1.png
...

# place the file in the same folder level as the negatives/
# folder from earlier

# find a representing image of the object save it as e.g.
# positives.png

# create a new folder called classifier at the same level.
# the final structure is now

some_root_folder/
  classifier/
  negatives/
     negatives1.png
     negatives2.png
     ...
  file.txt
  positives.png

Creating data samples

From the image containing the object, we synthesize 1000 new ones with a utility function. The results are saved in a special vectorized file format specified with the -vec argument.

# Spin up a terminal - make sure opencv is installed
opencv_createsamples \
  -img positives.png \
  -vec vec-positives \
  -num 1000 \
  -bg file.txt

Train the classifier

Next up, training uses the recently created vectorized file The training result i.e. the classifier is stored under the classifier/ folder. THe number of cascading levels of the algorithm is specified in the -numStages argument. -numNeg is the number of images without the object, located in the negatives folder. -numPos is the 1000 images we created in the vectorized file.


# basic parameters - don't expect a good result

opencv_traincascade \
  -data classifier/ \
  -vec vec-positives \
  -bg file.txt \
  -numPos 1000 \
  -numNeg 5 \
  -numStages 4

Test the classifer

After training a couple of files are locted in the classifier/ folder, but one of them is the actual trained classifier. It contains a collection of features based on vector metrics, which defines the object after training.
By providing a test image and the classifier .xml file, we can run the following python script and see if any green bounding boxes appear in the image. If so, the classifier has identified what it believes to be or object.

# python 3.x

import cv2
import sys

# Get user supplied values
imagePath = sys.argv[1]

# provide full path to *.xml
cascPath = sys.argv[2]

# show the paths used
print(imagePath)
print(cascPath)

# Create the haar cascade
cascade = cv2.CascadeClassifier(cascPath)

# Read the image
image = cv2.imread(imagePath)

# make it grayscale for 2D representation
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Detect objects in the image
faces = cascade.detectMultiScale(
    gray,
    scaleFactor=1.1,
    minNeighbors=5,
    minSize=(30, 30)
)

# Draw a rectangle around the objects
for (x, y, w, h) in faces:
    cv2.rectangle(image, (x, y), (x+w, y+h), (0, 255, 0), 2)

cv2.imshow("objects found", image)
cv2.waitKey(0)

How it really works

The algorithm used is an optimized version of the original algorithms presented in “Rapid Object Detection using a Boosted Cascade of Simple Features” by Paul Viola and Michael Jones

It uses Haar features to extract information about the object when training and searches for them later when testing. A Haar feature, popularized by Alfred Haar back in the early 1900’s, is similar to a convolutional kernel, where a set of pixel values in a grid formation, slides over an image. Haar features can be used to find e.g. edges/lines in an image. The grid used is variable in size but contain 2 pixel groupings. A simple seperation strategy of a square grid, is dividing it in the middle either vertically (creating a left/right grouping) or horizontally (up/down grouping). There might be more, but the grid is always divided into two groups. The groups does not have to be adjacently connected in the grid. As an example, the top left pixel could be assigned to the same group as the lower right pixel in the grid.

# pixel groupings
# 0 and 1 indicates which group the pixel belongs to

# a 2x2 grid split into 2 groups vertically

[ 
  0, 1
  0, 1
]

# a 2x2 grid split into 2 groups in a cross fashion

[
  0, 1
  1, 0
]

# a 2x3 grid split into 2 groups by seperating the middle out
[
  0, 0
  1, 1
  0, 1
]

Each feature represents the sum of the pixels from the first group subtracted from the sum of pixels in the second group. If we use a 24×24 image, there are more than 160.000 features available given multiple grid sizes from 2×2 and up to 24×24. In order to speed up the process a summed area table or an integral image is created instead. It’s a 2D lookup table which allows for finding the sum of sub grid in an image, vastly reducing the number of computations needed for subtracting pixels from each other when calculating features.

# an example of the summed area table

# here an 'image' of pixel intensities is given
[
  1, 2, 5
  1, 1, 1
]

# so we run over the image once from the top left corner to create the table given the following information

SUM(x,y) = i(x,y) + SUM(x-1,y) + SUM(x,y-1) - SUM(x-1,y-1)

Given a rectangle with 4 corners (A,B,C,D) the sum of the rectangle is 
D - C - B + A
where A is top left, B is top right, C is bottom left, and D is bottom right.

If we have the summed area table, its a matter of lookups for the values and a simple subtraction, rather than re-iteratively calculating the numbers over and over, if A is fixed and D increases.

# the summed area table
[
  {1},       {1+0+2-0}, {3+0+5-0},
  {1+1+0+0}, {1+2+3-1}, {1+5+8-3},
]

Most features are unimportant, and thus a filtering approach is therefore applied. Here a feature is weighted based on the training images is classifies correctly. A collection of features given the optimal classification of the training data is therefore a combined feature, which is good, while each individual feature might be less good on its own. By doing this, the amount of features below a certain error threshold, i.e. always classifying incorrectly can be fully discarded for the combined feature, drastically reducing the total number of features to use for testing.

However this reduction is still to slow. In order to speed up the process, features are ordered and applied in stages which is known here as cascading. A preliminary stages suggest that there might be an object which some level of uncertainty, and each stage gradually becomes more confident that the object exists. If a stage fails, we stop trying, and identify no object in the tested image. A failed stage is a stage where a combined sum of weights for the features of the stage, drops below a threshold. Now the testing is really fast, compared to the initial check of 160.000 features. We can sometimes end up using just 10 – 30 filtered features.