Google DeepMindthe company’s AI research wing, first unveiled Project Astra at I/O this year. Now, more than six months later, the tech giant announced new capabilities and improvements in the artificial intelligence (AI) agent. Drawing upon Gemini 2.0 AI models, it can now converse in multiple languages, access multiple Google platforms, and has improved memory. The tool is still in the testing phase, but the Mountain View-based tech giant stated that it is working to bring Project Astra to the Gemini app, Gemini AI assistant, and even form factors like glasses.
Google Adds New Capabilities in Project Astra
Project Astra is a general-purpose AI agent that is similar in functionality to OpenAI’s vision mode or the Meta Ray-Ban smart glasses. It can integrate with camera hardware to see the user’s surroundings and process the visual data to answer questions about them. Additionally, the AI agent comes with limited memory that allows it to remember visual information even when it is not actively being shown via the camera.
Google DeepMind highlighted in a blog post that ever since the showcase in May, the team has been working on improving the AI agent. Now, with Gemini 2.0, Project Astra has received several upgrades. It can now converse in multiple languages and mixed languages. The company said that it now has a better understanding of accents and uncommon words.
The company has also introduced tool use in Project Astra. It can now draw upon Google SearchLens, Maps, and Gemini to answer complex questions. For instance, users can show a landmark and ask the AI agent to show directions to their home, and it can recognise the object and verbally direct the user home.
Memory function of the AI agent has also been upgraded. Back in May, Project Astra could only retain visual information from the last 45 seconds, it has now been extended to 10 minutes of in-session memory. Additionally, it can also remember more past conversations to offer more personalised responses. Finally, Google claims that the agent can now understand language at the latency of human conversation, making interactions with the tool more human-like.