In December, Google Photos incorporated cinematic photos that can be generated automatically from the application. Cinematic photos can turn a 2D photo into 3D animation for a more immersive experience. Users can now start to see them in the recent photos section of the app.
Now the company explained in its official blog how is the technology behind this tool that allows you to relive memories with this additional layer of movement.
A combination of algorithms and machine learning models
Cinematographic photographs require a depth map to provide information about the 3D structure of a scene. Techniques for calculating depth on a smartphone are based on the simultaneous capture of several photos from different points of view.
Now, to generate this effect in photos that were not taken in this way, a convolutional neural network with an encoder-decoder architecture was trained to predict a depth map from a single RGB image. Using just one view, the model learned to estimate depth using monocular signals, such as relative sizes of objects, linear perspective, blur blur, etc.
The company created its own dataset to train the monocular depth model using photos captured on a custom 5-camera rig, as well as another dataset of portrait photos captured on Pixel 4.
Combining multiple data sets in this way exposes the model to a greater variety of camera hardware and scenes, with the goal of improving your predictions when analyzing photos taken in natural settings.
To mitigate errors in the depth map, a filtering was applied that optimizes the edges and a segmentation model from DeepLab trained on the Open Images dataset was also used.
One of the challenges in reconstructing scenes in 3D is to achieve an image that shows the changes in depth while maintaining a suitable texture and without noise. For that, artificial intelligence is also used.
The last step is to frame the photo. “In general, the reprojected 3D scene does not fit perfectly into a vertical-oriented rectangle, so it was also necessary to frame the output in the correct aspect ratio while preserving key parts of the input image. To achieve this, we use a deep neural network that predicts the prominence per pixel of the entire image. By framing the virtual camera in 3D, the model identifies and captures as many prominent regions as possible, while ensuring that the rendered mesh completely occupies each frame of output video, ”the blog remarks.