New On-Device AI Enables Real-Time Voice and Vision Conversations
A new multimodal AI system allows users to engage in real-time voice and vision interactions entirely on their devices. Utilizing Gemma 4 E2B for speech and vision understanding, and Kokoro for text-to-speech, this technology aims to assist language learners by providing a local, cost-effective solution.
The AI system operates on-device, facilitating natural conversations without the need for server support.
It employs Gemma 4 E2B for understanding speech and vision, while Kokoro handles text-to-speech responses.
This early research project is designed to help users learn English by allowing them to speak and show objects to the AI, which responds in real-time.
The technology is currently compatible with macOS on Apple Silicon and Linux with a supported GPU, requiring approximately 3 GB of free RAM.
Users can start the application by cloning the GitHub repository and running a local server to access the AI via a web browser.
The system features voice activity detection, hands-free operation, and the ability to interrupt the AI during responses.
This development is seen as a potential game-changer for language learning, with hopes for future applications on mobile devices.