I watched the Microsoft video on Skype’s new Translator feature. According to Microsoft it is a big leap in voice translation, thanks to the move to deep neural networks for speech recognition. The video shows a German and an English speaker having a Skype call and Skype translating and synthesizing the speech in their respective native language. All very cool and as the video states all very Star Trek.
It made me think about the speech recognition technologies I presently use. Anyone who has access to such services has probably at one point shared a laugh with a friend or coworker over the latest voice to text faux pas. Who hasn’t heard “Hey, look at what <insert service provider / technology brand name here> thinks you said” They can be as hilarious as auto-correct mistakes on your cell phone.
I sure hope that the new Translator feature performs more reliably. Can you image watching the the screen and seeing the reaction of the person you are talking to when Skype translates your “mountain biker and trails” to “mountain biker entrails”. In English you can easily see the mistake, but in German it is not so obvious -> “mountainbiker und wanderwegen” becomes “mountainbiker eingeweide”.
What struck me the most in the video was the apparent lack of progress in speech synthesis. It sounds like they are still using the same voice engine that shipped with Windows XP. This alone makes me look slightly askance at the new feature. A sexy new thing like real time voice translation really deserves a better speech engine than that. Something that sounds a bit more lifelike. It really is not that much better sounding than the phone based technology we had in the 1980’s. Anyone else remember trying to program a Heathkit HERO robot to talk? No, just me? Well, never mind then. Cool stuff in the 80’s, lame today.