Object detection and tracking in PyTorch

Detecting multiple objects in images and tracking them in videos

In my previous story, I went over how to train an image classifier in PyTorch, with your own images, and then use it for image recognition. Now I’ll show you how to use a pre-trained classifier to detect multiple objects in an image, and later track them across a video.
What’s the difference between image classification (recognition) and object detection? In classification, you identify what’s the main object in the image and the entire image is classified by a single class. In detection, multiple objects are identified in the image, classified, and a location is also determined (as a bounding box).

Object Detection in Images

There are several algorithms for object detection, with YOLO and SSD among the most popular. For this story, I’ll use YOLOv3. I won’t get into the technical details of how YOLO (You Only Look Once) works — you can read that here — but focus instead of how to use it in your own application.
So let’s jump into the code! The Yolo detection code here is based on Erik Lindernoren’s implementation of Joseph Redmon and Ali Farhadi’s paper. The code snippets below are from a Jupyter Notebook you can find in my Github repo. Before you run this, you’ll need to run the download_weights.sh script in the config folder to download the Yolo weights file. We start by importing the required modules:
from models import *
from utils import *
import os, sys, time, datetime, random
import torch
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torch.autograd import Variable
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
Then we load the pre-trained configuration and weights, as well as the class names of the COCO dataset on which the Darknet model was trained. As always in PyTorch, don’t forget to set the model in eval mode after loading.
config_path='config/yolov3.cfg'
weights_path='config/yolov3.weights'
class_path='config/coco.names'
img_size=416
conf_thres=0.8
nms_thres=0.4
# Load model and weights
model = Darknet(config_path, img_size=img_size)
model.load_weights(weights_path)
model.cuda()
model.eval()
classes = utils.load_classes(class_path)
Tensor = torch.cuda.FloatTensor
There are also a few pre-defined values above: The image size (416px squares), confidence threshold and the non-maximum suppression threshold.
Below is the basic function that will return detections for a specified image. Note that it requires a Pillow image as input. Most of the code deals with resizing the image to a 416px square while maintaining its aspect ratio and padding the overflow. The actual detection is in the last 4 lines.
def detect_image(img):
    # scale and pad image
    ratio = min(img_size/img.size[0], img_size/img.size[1])
    imw = round(img.size[0] * ratio)
    imh = round(img.size[1] * ratio)
    img_transforms=transforms.Compose([transforms.Resize((imh,imw)),
         transforms.Pad((max(int((imh-imw)/2),0), 
              max(int((imw-imh)/2),0), max(int((imh-imw)/2),0),
              max(int((imw-imh)/2),0)), (128,128,128)),
         transforms.ToTensor(),
         ])
    # convert image to Tensor
    image_tensor = img_transforms(img).float()
    image_tensor = image_tensor.unsqueeze_(0)
    input_img = Variable(image_tensor.type(Tensor))
    # run inference on the model and get detections
    with torch.no_grad():
        detections = model(input_img)
        detections = utils.non_max_suppression(detections, 80, 
                        conf_thres, nms_thres)
    return detections[0]
Finally, let’s put it together by loading an image, getting the detections, and then displaying it with the bounding boxes around detected objects. Again, most of the code here deals with scaling and padding the image, as well as getting different colors for each detected class.
# load image and get detections
img_path = "images/blueangels.jpg"
prev_time = time.time()
img = Image.open(img_path)
detections = detect_image(img)
inference_time = datetime.timedelta(seconds=time.time() - prev_time)
print ('Inference Time: %s' % (inference_time))
# Get bounding-box colors
cmap = plt.get_cmap('tab20b')
colors = [cmap(i) for i in np.linspace(0, 1, 20)]
img = np.array(img)
plt.figure()
fig, ax = plt.subplots(1, figsize=(12,9))
ax.imshow(img)
pad_x = max(img.shape[0] - img.shape[1], 0) * (img_size / max(img.shape))
pad_y = max(img.shape[1] - img.shape[0], 0) * (img_size / max(img.shape))
unpad_h = img_size - pad_y
unpad_w = img_size - pad_x
if detections is not None:
    unique_labels = detections[:, -1].cpu().unique()
    n_cls_preds = len(unique_labels)
    bbox_colors = random.sample(colors, n_cls_preds)
    # browse detections and draw bounding boxes
    for x1, y1, x2, y2, conf, cls_conf, cls_pred in detections:
        box_h = ((y2 - y1) / unpad_h) * img.shape[0]
        box_w = ((x2 - x1) / unpad_w) * img.shape[1]
        y1 = ((y1 - pad_y // 2) / unpad_h) * img.shape[0]
        x1 = ((x1 - pad_x // 2) / unpad_w) * img.shape[1]
        color = bbox_colors[int(np.where(
             unique_labels == int(cls_pred))[0])]
        bbox = patches.Rectangle((x1, y1), box_w, box_h,
             linewidth=2, edgecolor=color, facecolor='none')
        ax.add_patch(bbox)
        plt.text(x1, y1, s=classes[int(cls_pred)], 
                color='white', verticalalignment='top',
                bbox={'color': color, 'pad': 0})
plt.axis('off')
# save image
plt.savefig(img_path.replace(".jpg", "-det.jpg"),        
                  bbox_inches='tight', pad_inches=0.0)
plt.show()
You can put together these code fragments to run the code, or download the notebook from my Github. Here are a few examples of object detection in images:

Object Tracking in Videos

So now you know how to detect different objects in an image. The visualization might be pretty cool when you do it frame by frame in a video and you see those tracking boxes moving around. But if there are multiple objects in those video frames, how do you know if an object in one frame is the same as one in a previous frame? That’s called object tracking, and uses multiple detections to identify a specific object over time.
There are several algorithms that do it, and I decided to use SORT, which is very easy to use and pretty fast. SORT (Simple Online and Realtime Tracking) is a 2017 paper by Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, Ben Upcroft which proposes using a Kalman filter to predict the track of previously identified objects, and match them with new detections. Author Alex Bewley also wrote a versatile Python implementation that I’m gonna use for this story. Make sure you download the Sort version from my Github repo since I had to make a few small changes to integrate it in my project.
Now on to the code, the first 3 code segments will be the same as in the single image detection, since they deal with getting the YOLO detections on a single frame. The difference comes in the final part where for each detection we call the Update function of the Sort object in order to get references to the objects in the image. So instead of the regular detections from the previous example (which include the coordinates of the bounding box and a class prediction), we’ll get tracked objects which, besides the parameters above, also include an object ID. Then we display the almost the same way, but adding that ID and using different colors so you can easily see the objects across the video frames.
I also used OpenCV to read the video and display the video frames. Note that the Jupyter notebook is quite slow in processing the video. You can use it for testing and simple visualizations, but I also provided a standalone Python script that will read the source video, and output a copy with the tracked objects. Playing an OpenCV video in a notebook is not easy, so you can keep this code for other experiments.
videopath = 'video/intersection.mp4'
%pylab inline 
import cv2
from IPython.display import clear_output
cmap = plt.get_cmap('tab20b')
colors = [cmap(i)[:3] for i in np.linspace(0, 1, 20)]
# initialize Sort object and video capture
from sort import *
vid = cv2.VideoCapture(videopath)
mot_tracker = Sort()
#while(True):
for ii in range(40):
    ret, frame = vid.read()
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    pilimg = Image.fromarray(frame)
    detections = detect_image(pilimg)
    img = np.array(pilimg)
    pad_x = max(img.shape[0] - img.shape[1], 0) * 
            (img_size / max(img.shape))
    pad_y = max(img.shape[1] - img.shape[0], 0) * 
            (img_size / max(img.shape))
    unpad_h = img_size - pad_y
    unpad_w = img_size - pad_x
    if detections is not None:
        tracked_objects = mot_tracker.update(detections.cpu())
        unique_labels = detections[:, -1].cpu().unique()
        n_cls_preds = len(unique_labels)
        for x1, y1, x2, y2, obj_id, cls_pred in tracked_objects:
            box_h = int(((y2 - y1) / unpad_h) * img.shape[0])
            box_w = int(((x2 - x1) / unpad_w) * img.shape[1])
            y1 = int(((y1 - pad_y // 2) / unpad_h) * img.shape[0])
            x1 = int(((x1 - pad_x // 2) / unpad_w) * img.shape[1])
            color = colors[int(obj_id) % len(colors)]
            color = [i * 255 for i in color]
            cls = classes[int(cls_pred)]
            cv2.rectangle(frame, (x1, y1), (x1+box_w, y1+box_h),
                         color, 4)
            cv2.rectangle(frame, (x1, y1-35), (x1+len(cls)*19+60,
                         y1), color, -1)
            cv2.putText(frame, cls + "-" + str(int(obj_id)), 
                        (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 
                        1, (255,255,255), 3)
    fig=figure(figsize=(12, 8))
    title("Video Stream")
    imshow(frame)
    show()
    clear_output(wait=True)
After you play with the notebook, you can use the regular Python script both for live processing (you can take input from a camera) and to save videos. Here’s a sample of videos I generated with this program.
And that’s it, you can now try on your own to detect multiple objects in images and to track those objects across video frames. You can also do more research on YOLO and find out how to train a model with your own images.
Chris Fotache is an AI researcher with CYNET.ai based in New Jersey. He covers topics related to artificial intelligence in our life, Python programming, machine learning, computer vision, natural language processing and more.

Comments

Popular posts from this blog

Five Minutes to Your Own Website

15 Websites To Get Creative Commons Music For Free