A deep learning audiovisual model from Google could impact voice-search, retail and creative production.
Announced on the Research Blog, the method can, according to Google, identify audio found in a video by isolating spoken words and distinguishing between language in the foreground and background.
Applied to YouTube, the model could potentially eliminate the need for creators to manually transcribe and caption their content, a common practice for maximising both user enjoyment and search-engine optimisation.
Researchers behind the model believe it will have a range of applications, from speech enhancement and recognition in videos and voice search to videoconferencing and the ability to improve hearing aids.
"In the near term, this will streamline video production—especially valuable in mobile first-video where lesser speaker quality makes clean mixing critical for comprehension," said Patrick Givens, VP of VaynerSmart at VaynerMedia. "Looking into the future, as we see more consumer attention migrating to audio-first channels this will also ease the burden of audio production."
Advertisers and agencies scrambling to optimize for voice-based search also see promise.
"The tip of the iceberg in big data is the analysis, while data collection is below the surface," said Danish Ayub, CEO of MWM Studioz. "Similarly, with voice-search optimization, the part of the work you don't see is the hours of manpower that go into transcribing the video content to ensure searchability."
Ayub adds that the technology could eliminate the need for both transcribers and paid software that can convert audio into text.
Nate Shurilla, regional head of innovation at iProspect APAC believes that the model has far-reaching implications for retail.
"Imagine walking into any fast food joint and just announcing what you would like into the air, sitting down, and having your order brought to you, all while dozens of other customers are doing the same and getting their respective orders," said Shurilla. "That’s a big boost in efficiency." He added that at the same time, the technology would effect surveillance. "I’ll just leave that one to your imagination,” he said.
Shaad Hamid, head of SEO for Southeast Asia at APD believes that in the short term there will be more use cases for improving live-streaming of events, videoconferencing, hearing-aid devices, virtual assistants and any other application where multiple and simultaneous speech can cause audio quality to be compromised.
"From an advertiser’s perspective, using this technology, we can create videos that target multiple audiences with a single asset, saving time and reducing production costs while speeding up the campaign setup," he said.
For example, Hamid envisioned a property portal being able to tone down or dial up different audio within the same video depending on what the user is observed to be in the market for.
On the other hand, Hamid offered a word of caution. "Since no one’s really seen or heard how this type of ad will look or sound, it’s actual effectiveness as a technique for advertisers is anybody’s guess," he concluded.