Amazon is starting to shift Alexa’s cloud AI to its own silicon | GeekComparison

In this three-minute clip, Amazon engineers discuss migrating 80 percent of Alexa’s workload to Inferentia ASICs.

On Thursday, an Amazon AWS blog post announced that the company has moved most of the cloud processing for its Alexa personal assistant from Nvidia GPUs to its proprietary Inferentia Application Specific Integrated Circuit (ASIC). Amazon developer Sebastien Stormacq describes the hardware design of the Inferentia as follows:

AWS Inferentia is a custom chip built by AWS to accelerate machine learning inference workloads and optimize their costs. Each AWS Inferentia chip contains four NeuronCores. Each NeuronCore implements a powerful systolic matrix multiplication engine, which greatly accelerates typical deep learning operations such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps to reduce access to external memory, drastically reduce latency and increase throughput.

When an Amazon customer — usually someone who owns an Echo or Echo dot — uses Alexa’s personal assistant, very little of the processing happens on the device itself. The workload for a typical Alexa request looks something like this:

  1. A human speaks to an Amazon Echo and says, “Alexa, what’s the special ingredient in Earl Gray tea?”
  2. The Echo detects the wake word – Alexa – using its own built-in processing
  3. The Echo streams the request to Amazon’s data centers
  4. Within the Amazon data center, the voice stream is converted to phonemes (Inference AI workload)
  5. Still in the data center, phonemes are converted into words (Inference AI workload)
  6. Merging words into sentences (Inference AI workload)
  7. Phrases are distilled into intent (Inference AI workload)
  8. The intent is forwarded to an appropriate fulfillment service, which returns a response as a JSON document
  9. JSON document is parsed, including text for Alexa’s response
  10. The text form of Alexa’s response is converted to natural sounding speech (Inference AI workload)
  11. Natural speech audio is streamed back to the Echo device for playback – “It’s bergamot orange oil.”

As you can see, almost all of the actual work done to fulfill an Alexa request happens in the cloud, not in an Echo or Echo Dot device itself. And the vast majority of that cloud work is done not by traditional if-then logic, but by inference, which is the answering side of neural network processing.

According to Stormacq, shifting this inference workload from Nvidia GPU hardware to Amazon’s own Inferentia chip resulted in a 30 percent reduction in cost and a 25 percent improvement in end-to-end latency on Alexa’s text-to-speech workloads. Amazon isn’t the only company using the Inferentia processor: The chip powers Amazon AWS Inf1 instances, which are available to the general public and compete with Amazon’s GPU-powered G4 instances.

Amazon’s AWS Neuron software development kit allows machine learning developers to use Inferentia as a target for popular frameworks, including TensorFlow, PyTorch, and MXNet.

Listing image by Amazon

Leave a Comment