Every day, we battle against the unmoving traffic on our way to work. Often, the city’s biggest arteries are the most congested and we resign ourselves to wasting precious minutes and hours and breathing in the exhaust fumes from the vehicles around us. Perhaps, watching others with similar defeat clearly written in their slouched postures provides some solace. Unfortunately, there seems to be no solution to this unrelenting mass of tangled automobiles, fighting their way through sly bottlenecks.
How do we tackle this alarming problem with its myriad of frightening consequences, ranging from premature damage to the vehicles, to increased carbon emissions? The answer is, as Mr. Shiv Surya and Dr. Venkatesh Babu from the Indian Institute of Science (IISc), Bengaluru suggest, to first identify the small areas that are the prime spots for congestion and analyze the traffic at these points. Once done, we could look at appropriate solutions – either building alternate routes (more infrastructure) or making an adjustment in the traffic signal timings.
In order to take this first step, the researchers have developed ‘TraCount’, an automated system that analyses the images of a congested area using the surveillance cameras installed in this area, and provides an accurate count of the number of vehicles contributing to the traffic. For a human being, it is quite easy to identify a car partially hidden behind a tree or another car. For a computer, however, this can lead to often-confused predictions. Moreover, it is harder to detect vehicles present farthest from the camera due to significant occlusion and the small size of vehicles.
TraCount addresses these problems with the help of convolutional neural networks (CNNs), a class of artificial neural networks (ANNs) that are modelled after the human brain. ANNs classify data by being ‘trained’ to learn the relationship between a set of inputs and their labels. CNNs are more suitable computationally for processing images. They consist of layers of learnable filters that are activated by certain visual features/ concepts in an image. For instance, a filter used to detect a visual feature such as an edge of a vertical surface would recognize edges of a table including its legs.
The TraCount model consists of repeating blocks of a convolutional layer with a non-linearity layer and pooling layer. Convolutional layer consists of ‘learnable’ filters that are activated when they recognize some type of visual feature such as an edge of a surface or a blotch of a color in the first layers. The deeper convolution layers produce a strong activation or “fires” for higher visual concepts like wheel of a vehicle etc.. Pooling aggregates the strongest responses and reduces the volume of data, thus reducing the computational load.
TraCount comprises of two shallow fully convolutional (FC) sub-networks fused with a deep monolithic FC network. “A monolithic CNN is a CNN with each filter operating on the feature maps of all filters on a previous layer. A feature map is the output of one filter applied to the input from the previous layer”, explains Mr. Surya. Fully convolutional layers differ from the usual CNNs in that they have no fully connected layers at the end. Fully convolutional networks are suitable when we want to predict an image like output (in our case a density map of how vehicles are distributed in an image) rather than merely predict the count of vehicles.
For better detection of vehicles present in various scales and shapes and occlusion due to distance, different receptive fields are required to handle large variations in scale. "The size of the filter is one of the factors that determines its receptive field, which is the region of the image that the filter operates on its input”, says Mr. Surya. To augment the ability of the monolithic FC network to handle various sizes of vehicles, two smaller FC ‘sub’-networks with varied receptive fields are used. For better prediction accuracy, the predictions of the sub-networks are combined with that of the deeper monolithic network.
The team tested TraCount on the TRANCOS dataset, which consists of 1244 images of vehicular traffic. The Mean Absolute Error metric was used to evaluate the performance of the system. TraCount’s novel architecture reduced the error in classification by more than 4% compared to the baseline architecture and was still an improvement over using a single deep monolithic FC network.
On future work to improve TraCount, Mr. Surya mentions the use of attention modules. Attention modules help the CNN focus on the relevant parts of an image. “Attention can help a network to look at a region in the image and its context and disambiguate background that is misleading. This disambiguation can potentially arise from having networks that give feedback and use attention to increase or decrease context”, he says.
With TraCount paving the way for better handling of traffic in metropolitan areas, we can now look forward to a less time consuming commute to work and breathe easy quite literally.