The Internet is a bottomless mine of information in various forms – text, videos and images. Organizing this information for easy search and retrieval is very beneficial to internet users, and poses challenges to computer scientists. While lot of research progress has been made about categorizing textual data, the same cannot be said about images and videos. A group of researchers at the Indian Institute of Science, Bangalore, has been attempting to make video search on the Internet user-friendly. In a recent paper, Prof. Chiranjib Bhattacharyya and a Ph.D. scholar Dr. Adway Mitra, at the Department of Computer Science and Automation (CSA), and Prof. Soma Biswas from the Department of Electrical Engineering, have presented techniques to this end.
The researchers have developed the idea of providing “video summaries” so that users can search and find interesting videos easily. “If you want some specific information from a set of videos, you do not want to watch all of them to know if they are relevant for you. That's why we need video summaries that can be generated automatically and are informative to users”, says Dr. Mitra on the purpose of this study.
The researchers focused on the objects in a video in order to provide a concise summary using frames from different parts of the video. “By making use of object detectors and action detectors, the computer can know the contents of an image, with limited accuracy.”, explains Dr. Mitra.Finding the objects in a video is based on the technique of ‘entity discovery’ - identifying a particular object or person and tracking all of their occurrences in the video. A ‘tracklet’ latches onto a particular entity and is a collection of frames containing that entity. The process of grouping tracklets based on the entities associated with them is called ‘tracklet clustering’.
Conventional techniques of entity discovery have major drawbacks. They struggle in cases where an entity appears in a video for a while and then reappears at a far later time. Entity detection in frames is based on computer vision that is a relatively new area of research and 100% accurate detection at all times is still a challenge. Also, existing methods fail to work satisfactorily with streamed videos since the algorithm will get only one shot to pass over the entire video sequence.
To address these drawbacks, the researchers propose the concept of ‘temporal coherence’ with two levels - detection and tracklet. At the detection-level, it is assumed that there will be little change in features within a tracklet since the focus is on a single entity. Tracklets that are closer to each other in space and time, but are non-overlapping, are likely to focus on the same entity. But, overlapping tracklets must focus on two separate entities. “Since no meta data is available with the video, it is important to leverage structural properties like temporal coherence”, says Dr. Mitra.
Since the algorithm will not know the number of entities that are in a video until it has ‘watched’ the entire video, it needs to be adaptable to new discoveries made during the video processing. The researchers used a statistical modeling approach called Bayesian nonparametrics that adapts very well to an unknown number of entities.
This method is able to prevent or reduce cases where a tracklet supposed to be tracking person A ends up attaching to person B compared to other methods. It was also found to work well with streaming videos, beating other models. This study is one of the many attempts to make video tagging and video summaries automated. Hopefully, efforts like this may reshape the tedious task of video searching.