OpenAI, the AI startup founded by Elon Musk, has recently unveiled its latest creation, POINT-E, a machine capable of producing 3D point clouds directly from text prompts. Unlike existing systems that require hours and multiple GPUs to generate images, POINT-E only needs one GPU and a couple of minutes. This breakthrough has significant implications for various industries that rely on 3D modeling, including CGI effects in movies, video games, VR and AR, and NASA’s mapping missions. Despite efforts to automate object generation and mobile apps that scan real-world objects as 3D images, creating photorealistic 3D images remains a time-consuming and resource-intensive process.
Text-to-Image systems like OpenAI’s DALL-E 2 and other similar platforms have gained immense popularity in recent years. The emergence of text-to-3D is an extension of this research. Point-E sets itself apart from other systems by leveraging a large corpus of text and image pairs, enabling it to handle complex prompts. By sampling an image using the text-to-image model and then sampling a 3D object conditioned on the sampled image, Point-E can produce a 3D object from a text prompt in a matter of seconds. Expensive optimization procedures are not required for this process.
To create a 3D point cloud from a text prompt, Point-E first generates a synthetic view 3D rendering based on the prompt. This generated image is then processed through a series of diffusion models to create a 3D RGB point cloud. The models produce a coarse 1,024-point cloud model, followed by a finer 4,096-point one. Although the evaluation shows that Point-E performs slightly worse than state-of-the-art techniques, it significantly reduces the time required for generating samples. OpenAI has made the open-source code for this project available on Github for anyone interested in exploring and experimenting with it.