NVIDIA’s new AI turns videos of the real world into virtual landscapes


Attendees of this year’s NeurIPS AI conference in Montreal can spend a few moments driving through a virtual city, courtesy of NVIDIA. While that normally wouldn’t be much to get worked up over, the simulation is fascinating because of what made it possible. With the help of some clever machine learning techniques and a handy supercomputer, NVIDIA has cooked up a way for AI to chew on existing videos and use the objects and scenery found within them to build interactive environments.

NVIDIA’s research here isn’t just a significant technical achievement; it also stands to make it easier for artists and developers to craft lifelike virtual worlds. Instead of having to meticulously design objects and people to fill a space polygon by polygon, they can use existing machine learning tools to roughly define those entities and let NVIDIA’s neural network fill in the rest.

“Neural networks — specifically generative models — will change how graphics are created,” Bryan Catanzano, NVIDIA’s vice president of applied deep learning, said in a statement. “This will enable developers, particularly in gaming and automotive, to create scenes at a fraction of the traditional cost.”

Here’s how it works. Catanzano told reporters that researchers trained the fledgling neural model with dashcam videos taken from self-driving car trials in cities for about a week on one of the company’s DGX-1 supercomputers. (NVIDIA CEO Jensen Huang once called the DGX-1 the equivalent of “250 servers in a box,” so pulling off a similar feat at home seems all but impossible.)


Meanwhile, the research team used Unreal Engine 4 to create what they called a “semantic map” of a scene, which essentially assigns every pixel on-screen a label. Some pixels got lumped into the “car” bucket, others into the “trees” category, or “buildings” — you get it. Those clumps of pixels were also given clearly defined edges, so Unreal Engine ultimately produced a sort of “sketch” of a scene that got fed to NVIDIA’s neural model. From there, the AI applied the visuals for what it knew a “car” looked like to the clump of pixels labeled “car” and repeated the same process for every other classified object in the scene. That might sound tedious, but the whole thing happened faster than you might think — Catanzaro said the car simulation ran at 25 frames-per-second and that the AI rendered everything in real time.

NVIDIA’s team also used this new video-to-video synthesis technique to digitally coax a team member into dancing like PSY. Crafting this model took the same kind of work as the car simulation, only this time the AI was tasked with figuring out the dancer’s poses, turning them into rudimentary stick figures and rendering another person’s appearance on top of them.