A superhero who was able to see two seconds into the future wouldn’t be invincible, but she’d have a leg up on mere mortals. On Monday, the Massachusetts Institute of Technology announced its new artificial intelligence, and it’s a prototype of such a being. Based on a photograph alone, it can predict what’ll happen next, then spit out a one-and-a-half second video clip depicting that possible future. The breakthrough could yield smarter autonomous cars or security systems.
MIT researchers trained the A.I. by feeding over two million videos into its two-pronged deep-learning system. The first neural network learned to generate video by absorbing information about all two million videos. The second neural network learned to discriminate real from fake videos. These two networks then engage in what’s called adversarial learning: They compete to outsmart each other. “One network (‘the generator’) tries to generate a synthetic video, and another network (‘the discriminator’) tries to discriminate synthetic versus real videos,” lead author Carl Vondrick writes. “The generator is trained to fool the discriminator.”
Here’s how it works:
The results are promising, even though the little, low-resolution squares sometimes look like something from a nightmare. It’s a little like Google’s DeepDream — but video and the future. Any researchers who could benefit from a glimpse into the future — even a speculative glimpse — will be thrilled at MIT’s A.I. development.
For example, autonomous cars rely in part on computer vision: On-board cameras send images of their surroundings to the car’s A.I., which in turn deciphers those images and reacts accordingly. (Tesla, arguably the leading autonomous car company, has since learned to rely more on radar.) When human drivers see a pedestrian on the move, they can infer that pedestrian’s future whereabouts and plan their actions accordingly. If a current autonomous car “sees” a pedestrian, it’s unable to so infer. If an autonomous car equipped with MIT’s predictive video system sees a pedestrian, it could almost so infer.
These are the early days of such predictive systems, and the outputs are low-resolution and not always plausible. MIT had human observers serve as the final discriminators, to see how good the generator got at fooling the second-prong A.I. discriminator.
“The algorithm generated videos that human subjects deemed to be realistic 20 percent more often than a baseline model,” Adam Coner-Simons reports for MIT News.
In the future, the researchers expect that this system’s descendants will generate higher-resolution, longer predictive videos. Our clairvoyant superhero is only going to improve.