Large language models (LLMs) have advanced remarkably, making it inevitable that we will soon interact with our personal voice assistants that are knowledgeable about everything and able to carry out actions for us as well. Basically capable of performing almost any task a super smart helper can handle — from vacation planning, tour guide, education, entertainment, shopping and more!
Many of the current challenges with LLMs are rapidly being resolved. Newer models like GPT-4o can combine vision with speech for enhanced intelligence. The models are being updated more frequently, reducing outdated information, and the updates are happening faster by breaking large models into submodels that can be independently controlled. Issues like hallucinations can be cross-checked for accuracy and minimized through source citing.
However, a significant problem remains unaddressed. They can’t be always on and always listening and watching and sending data to the Cloud based LLM. It’s too much bandwidth, destroys privacy, and would lead to a lot of unintended interruptions. Today, it can be thought of more generally as how we get an assistant’s attention without being interrupted when we don’t want it and in a way that preserves privacy and conserves energy.
Full Disclosure. I started an on device neural net speech recognition company called Sensory many years ago. We specialize in high-accuracy on-device speech recognition including wake words. Many credit Sensory with creating the first usable wake words. Sensory is likely the only company that has licensed wake word technology to major players like Google, Amazon, Microsoft, Cupertino, Samsung, Huawei, and Baidu, and for sure, we are the only company that has licensed all of these companies and hundreds of others. Sensory technology has shipped in over 3 billion products and many of these used our wake words. If you’ve used speech recognition, you’ve probably used Sensory technology — most likely in a wake word. So, I have a bias for wake words!
Current Approaches to talking to Voice Assistants
Many people want wake words to disappear, envisioning a smart assistant that “knows” when you’re talking to it. This might be ideal but it’s unlikely. Instead, voice assistants will feature a hybrid approach that combines wakewords with on-device vision, touch and various other low power sensors to detect people, noise and environments.
Let’s examine some of the current approaches: