Dhruv Batra seeks to remove ambiguity in computer visual recognition systems

11 Aug 2014

When Dhruv Batra of the Virginia Tech College of Engineering travels in September to Zurich for the 2014 European Conference on Computer Vision, he will be a rising star in the growing field of vision and pattern recognition in computers.

The assistant professor with Virginia Tech’s Bradley Department of Electrical and Computer Engineering previously co-led a tutorial in the research field at another industry conference in Ohio this past June. On his way to Zurich, Batra will give talks on the same subject -- creating software programs that help computers “see” and understand photographs just as humans can – at software giant Microsoft’s research lab at Cambridge University and then a separate event at Oxford University, both in the United Kingdom.

The travel comes on the heels of Batra’s spring acceptance of three major federal research grants worth than more a combined $1 million: A National Science Foundation CAREER Award, a U.S. Army Research Office Young Investigators Award, and an U.S. Office of Naval Research grant.

The awards -- valued at $500,000 for five years for the CAREER Award, $150,000 for three years from the Army, and $360,000 for three years from the Navy, all focus on machine learning and computer vision -- creating algorithms and techniques that will teach computers to better understand photographic images, and quickly so.

There are software programs that can track facial recognition and are used by scores of law enforcement and security offices, but finding that face can be tricky for computers. A patch from a photographic image may look like a face to a computer vision module system, but it may simply be an incidental arrangement of tree branches and shadows, said Batra.

In other words, computers may be “hallucinating” faces floating in thin air. Batra wants to halt such errors with a new visual system that jointly reasons about multiple plausible hypotheses from different vision modules such as 3-D scene layout, object layout, and pose.

“When we see an image, we see things that a computer won’t see. We see people, action, and the environment, the layout of space, and what is in front and behind. We interpret right away emotion, action, and place, the city or the rural country,” said Batra. “Computers cannot do that, it’s just a 2-D image, flat. Computers are not intuitive. We relate. Computers do not do so well with ambiguity.”

Batra said machine perception systems today are often accurate only in a narrow regime – for instance recognizing humans in images only when they are standing upright, with limbs at the side. A person can be mistaken for a tree, or a person hunched over a bicycle as they travel down a street may be ignored over a nearby street lamp.

Batra said improved recognition systems for computers can used in a variety of means outside of law enforcement, including self-driving cars and emergency rescue personnel looking for people who may be lost in a rural area, or trapped within rubble of a disaster site inside a city.

Once computers better understand photographs or renderings they are looking at, they will be able to form multiple hypotheses of the images it’s seeing, to interpret thoughts through visual cues. In other words, computers will be able to tell users if a person in the image is riding that bicycle along that street, or walking along the same street, or riding the bike or walking along a beach or in a rural forest.

Emotional interpretation also is possible, computers being able to determine by recognition a person crying or not.

Once the computers form their hypotheses of the image, it can alert the end user for feedback. Batra said the same technology developed for understanding images can be used for self-driving cars to recognize pedestrians entering the street versus nearby objects.

It also could be used in fields outside of Batrra’s research area, such as voice applications to differentiating accents in mobile software applications such as Apple’s Siri, which can understand different languages, but has trouble with variations pronunciations.

As part of his work for the awards, Batra is building a high-end compute cluster with 500 cores and 4 Terabytes of RAM, including servers equipped with Graphical Processing Units, the latter alone worth $40,000. Silicon Valley-based technology firm NVIDIA is supporting Batra’s work through an equipment donation of 8 Tesla K40 GPUs.

“This is the period of building the computing infrastructure to support development and execution of the ideas we proposed,” said Batra. “The cluster I am building will have more computer power than all machines in the department put together. Aiming to mimic the human brain’s capabilities is no easy feat, but the future is exciting.”

Batra leads the Virginia Tech Machine Learning and Perception Group and is a member of the Virginia Center for Autonomous Systems and the university’s Discovery Analytic Center. He received his a bachelor’s degree from Banaras Hindu University’s Institute of Technology in 2005, and his master’s and doctoral degrees from Carnegie Mellon University in 2007 and 2010, respectively, all in electrical and computer engineering.