The Future Architecture of LLMs

Large Language Models (LLMs) are rapidly improving, with fewer hallucinations, more thoughtfully researched answers, and they are getting easier to work with! However, they are also growing exponentially in size, computational needs, energy consumption, and training costs.

In parallel with the rise in capabilities, consumption is also rapidly growing. The latest multimodal models with voice input are driving growth in interactive voice assistants across markets like finance, medical, enterprise, consumer, automobiles, hospitality, and retail/food/service industries.

Most of the growth today in LLMs is in cloud services, and much of this growth is not yet fully monetized with SaaS models, monthly fees, and advertisements. Keeping everything in the cloud is an expensive venture that users may not appreciate due to cost concerns or the intrusion of advertising. However, moving everything on-device is also impractical, as running LLMs can be quite expensive in terms of hardware.

There are three major platforms for LLMs: Cloud Training, Cloud Inference, and On-Device Inference. Training requires larger and larger systems and will remain a cloud-based offering. The inference market will grow to be much bigger than training and will likely rely on hybrid approaches combining client and cloud.

If hotwords are deployed, they will need to be on-device. This is essential for both bandwidth reasons and privacy. It’s likely that these voice-activated low-power wakewords will move into the realm of intelligent on-device speech controllers that do not always require a hotword to be spoken, but intelligently modify parametric control to better suit the user experience and optimize performance.

After the hotword, I expect a growing number of use cases will opt for speech-to-text translation on-device. This is also useful for privacy and bandwidth reduction, as sending text will not only be less expensive but will be faster as well, since text data can be 1/1000 the size of speech!

Small LLMs, utilizing under 7 billion parameters, have emerged. Quantization and other approaches can shrink these down to run efficiently on-device. However, Sensory’s experimentation has found that this shrinking works well for domain-specific applications. For general intelligence and the ability to engage in conversational dialogue, bigger is way better, and these LLMs will likely run in the cloud for some time or possibly on high-end products costing many thousands or tens of thousands of dollars.
Sensory is seeing increasing demand for hybrid approaches with a few distinct architectures:

Wakeword on device, LLM in the cloud
Wakeword and speech-to-text (STT) on-device, LLM in the cloud
Wakeword and STT on-device with a domain-specific tiny LLM on-device
Wakeword and STT on-device with a domain-specific small LLM on-device AND cognitive arbitration to decide when to send to a cloud LLM

We are also seeing interest in hot-word and STT on-device with an on-device SLM, but the SLM quality does not seem to be there yet…it will probably get there as LLM’s and processing capabilities continue to improve.

The Future Architecture of LLMs

The Future Architecture of LLMs

Stay Connected

Get the latest speech recognition and artificial intelligence resources and industry updates.