Saw an interesting talk yesterday. The Visual Geometry Group from Oxford University claimed to have developed a technique for automatically tagging video content (both arbitrary objects and faces) so that it was easily searchable in real time. The faces could even be labelled as particular actors or characters using online resources. The goal was to have a video version of Google where you can search by names, faces, logos, object images, anything. Although searching is not an area I particularly follow, the work seemed to have important implications for machine vision and I was amazed at how effective it was as described in the presentation.
The system works by breaking images down into small features (words) that together (in hundreds or thousands) combine to make objects and scenes that work regardless of the size of the object in the frame, it’s position, or even its orientation. This is done off-line and is currently quite slow, though they’re hoping to speed up to real time eventually.
Once the video (they’ve been working mainly with movies) has been indexed, we were told, then you can select any object on any frame and use that to search the rest of the movie. So, for instance you can select an object (they show an example where they choose a tie from the movie Groundhog Day) and then it comes up with other instances of that or other ties. The same thing should happen with other kinds of objects. They also have a system for identifying and tracking actors in frames, and then adding the character/actor’s name to the index using fansite transcripts of movies and TV shows.
I was quite excited to write about this after watching the presentation, and had intended to read up their papers to really understand the work better before writing this blog. But fortunately or unfortunately, I chose to use their demo first, and decided not to bother: the technology does not work nearly as well they claimed… at least, not as far as I can see. I encourage you to go and try out the system for yourself and let me know if you have better success. I tried searching their demo movie Charadefor many objects: a pillar, a car, a phone, letters and numbers, a passport, a bottle… and more. I really wanted it to work because I had been impressed with the talk. I even tried searching for a tie like on Groundhog Day.
What I found was that instead of being able to identify objects, the system seemed to be picking up minor geometrical features of objects like textures and general shapes (like long and thin) and found those. In the particular case of the Groundhog Daytie, I suspect that the system was finding the large cross-hatch pattern, not the tie itself. Maybe I was looking for the wrong things, I don’t know. And maybe the face recognition system (which they have no online demo of) really does work better.
But it felt to me like the searches that work well are the exception rather than the rule, which goes to the credibility of the research… and suggests that machine vision really is in as poor a state as we all suspected.