This project aims to create a simple, real-time object-tracking system that lets a user manually select a target in a video feed and then automatically keeps that target “locked” as it moves through the frame. Standard object detectors like YOLO can identify people, cars, or other objects, but they treat every frame separately, which means they don’t maintain identity over time. This system solves this by combining YOLO detections with an IOU-based tracking method that compares box overlap across frames to decide which detection corresponds to the same object. This helps prevent tracking drift, reduces false switches between objects, and allows the tracker to stay locked even when the detector briefly misses the target.
The project addresses common problems such as inconsistent detection, user inability to track a specific object of interest, and the need for stable tracking during motion or occlusion. This system is made to be used in situations where a person, vehicle, or object needs identification and tracking through frames. This can be used in law enforcement areas as well as search and rescue. Overall, it provides a lightweight foundation for more advanced tracking methods and future extensions like prediction, optical flow, or behavior recognition.
The code implements a frame-by-frame video processing loop that combines deep-learning-based object detection with rule-based target tracking logic. Video frames are acquired using OpenCV’s VideoCapture interface, either from a live camera device or a stored video file. Each frame is passed into a pretrained Ultralytics YOLO model, which performs convolutional neural network inference to produce a set of bounding box detections in pixel coordinates for that frame.
The program maintains internal state variables that represent a movable “manual” selection box, a currently tracked bounding box, and counters that measure how long a target has been lost. When the system is unlocked, it compares all YOLO detections to the manual box using an Intersection-over-Union (IoU) calculation; if sufficient overlap is detected, the program locks onto that detection and stores its bounding box as the tracking reference. When locked, the code iterates through detections in each new frame and selects the detection with sufficient IoU overlap relative to the previously tracked box, effectively performing temporal association across frames. If no detection meets the overlap threshold for a specified number of frames, the system automatically unlocks and resets.
Throughout execution, the program renders bounding boxes and status labels directly onto the video frames, writes the annotated frames to an output video file, and handles user input for box movement, locking, unlocking, and program termination. The result is a real-time object tracking system that bridges neural network detection with classical tracking logic, without relying on a dedicated multi-object tracking algorithm.
As you may have noticed, the footage is extremely grainy and constantly glitches. This is because it is analogue feed received from a first person view (FPV) action drone I built around a couple of years ago. I though this would be the perfect opportunity/excuse to fly it around campus gathering footage of cars and pedestrians. This would also end up being a challenge for my program: due to the lower video quality, the model has more trouble picking up and tracking detections. Nevertheless, I tweaked the program so it could be more accurate and it ended up working perfectly.
After gathering some footage, I decided to write a simple script to test different YOLO models. This program would just overlay boxes over detections with their class name and certainty percentage. After going through a few models, I noticed the best one by far was the visDrone dataset. There were far more detections per frame and higher certainty levels. The model was working great, but I quickly noticed it was extremely computationally heavy. After some research, I found that by using the computers CPU is not very efficient for YOLO and computer vision programs. By implementing Cuda, the program is able to use the computer's graphics card (GPU), which enables the program to run smoother, faster, and more efficiently.
This program also supports the use of live received video via capture card (treated as a webcam). The basic setup for analogue FPV video input is a 5.8ghz receiver and an AV to USB capture card. Most FPV goggles support AV video output, so that can serve as a receiver,. The capture card, on the other hand, can be bought for around ~$20 off Amazon or AliExpress. Once connected, the operating system recognizes the capture card as a standard webcam device, allowing the program to access the live video feed using OpenCV’s VideoCapture interface without any special drivers or custom decoding. This makes the setup simple and flexible, as the same code can work with either a traditional webcam or an FPV receiver feed. The incoming analog signal is digitized by the capture card and processed frame-by-frame by the computer, where the YOLO model performs object detection and the tracking logic maintains a lock on selected targets in real time. This approach allows low-cost FPV hardware to be integrated into a computer vision pipeline, enabling experimentation with real-time detection and tracking using live aerial video.
1) Add any video files you with to use with the program in the same folder as the program (accepted files: .mov, .avi, .mp4)
2) Ensure you have downloaded and added visDrone.pt to the script’s folder. This can be done on the official website: https://docs.ultralytics.com/datasets/detect/visdrone/
3) Ensure you have downloaded and installed Ultralytics and cuda if you wish to use your GPU
4) If you are using live video, ensure your capture card/signal receiver is working.
5) Run the script tracker_lsoldano.py.
6) The program will prompt you to select either webcam (live video) or video file. Enter 1 or 2 respectively to select one.
7) If video file is selected, a list of available video files will be printed, and to select one, enter the corresponding number. If you do not see your file, it may not be an accepted file type or may not be in the same file directory as the script.
8) Next, you will be prompted to choose a targeter size. This is the size of the manual tracker window you will move across the screen to select your target. For more accurate selections, choose the smaller of the 3 options. Otherwise, choose a larger one to scan over larger areas. Again, to select one, enter the corresponding
9) The program will then confirm if your GPU is available for use, if it cannot use it, the program will run significantly slower and keep in mind your CPU may heat up significantly.
10) Click enter to start the program, a window should appear on your taskbar.
11) Use ‘W,A,S,D’ (up, left, down, right) to move the green targeting window around the screen. When it turns red, it is tracking a target. To unlock, press T, the window should turn green again—unlocking—and it should be recentered on the screen. To quit, press Q.
12) Once the window closes, a video file named output_manual_lock.mp4 should be saved with the footage of the tracking.