Deep learning and Siri by Alex Acero @ Apple
AI and ML
Artificial intellegence vs human intellegence The imitation game, eugen Goosman passed the Turing Test, 2014 Alpha Go, deepmind 2015
Introduction to deep learning
Improve on task T with respect to performance metric P based on experience E Perceptron learning (one layer NN): a(i) = a(i-1) + n * {target - output} * A(i) 1974 multi-ayer perceptron with backpropagation training deep learning is old tricks, more computing power, more data makes it possible and powerful Binary classification TouchID, speaker verification, face verification, emal spam, motion detection, credit card fraud N-ary classifiction MNIST (handwriting), speaker identification, word prediction (typing on iphone) Deep learning for speech (Deng, 2010)
Intro to Siri
Speech -> ASR -> NL (nature language) -> Dialog -> action (what to do) +—-> TTS (text to speech) Hands-free Siri: heySiri low power audio processor makes it possible to listen all the time Dictation voice to text message voicemail transcript 21 languages
Deep learning in TTS
Concatenative TTS (vs. parametric synthesizer) ———text processing ———| |——- signal processing ————– Text -> text analysis -> prosody generation -> unit selection -> waveform concatenation -> speeck unit database–^ unit database is the key unit selection is also using DNN learning: target cost + concatenation cost Acoustic features: prosodic, spectral use parametric HMM synthesizer to predict spectral, and use concatenate engine to cat them together Linguistic features: phoneme, syllable, word, phrase, utterance Deltas/variance in the concatenation Big differences from iOS9 to iOS10
Automatic speech recoginition (accent)
en_IN for india english speakers “World english 1” for all varieties of english “world english 2” mixed training data together
Siri: bringing conversational agents mainstream
Q&A
My question: is Siri running on dedicated chip or the general CPU? Answer: there is a ultra-low power CPU core that is listening all the time waiting for “HeySiri” starting command, when it detected this signal, it will power up the main CPU for later work. After that, the input voice will be recorded and compressed, then send to online servers for voice recognition and so on. (I guess the TTS part is also the same, the runtime part of DNN is done on cloud servers and voice units are send back to phones). (So both the training and runtime are done on GPU farms)