Talking Machines, Watching Machines

On this occasion, one of the founders of Numens, Pablo Castiella, Founding Partner and Managing Partner of baobab soluciones, will tell us through his own professional experience the evolution of voice (and image) recognition systems and how we have gone from “answering machines” to “Alexa”, “Siri” or “Ok Google” in just a few years.

I started my career in a company that, at the beginning of the 21st century, was at the forefront of voice recognition systems. It was a time when the fact that a machine could interpret answers to questions such as “what do you want” seemed like magic. Almost everything on the market was referred to (and hated) by society as an “answering machine”. That company was one of the first to use natural language (NLP) in Spain. Inside that software, programmed in C++ (other companies used Java) there were Markov algorithms (chains), and in addition, to improve the customer experience, they relied on “transparent operators” (people behind the algorithms, listening to the answers if the results of the algorithms were not of sufficient quality) and giving a solution to the system so that it would continue with the interaction.

The systems were prepared to understand answers like this: “I wanted to buy a ticket from Madrid to Salamanca for tomorrow at 8 o’clock” to which the machine, after interpreting the origin, destination, date and time, answered “Do you also want the return trip?

These machine responses were a concatenation of voice synthesis and pre-recorded natural audio. This synthesis is a recording of phonemes, diphonemes and triphonemes that is made by a speaker in a studio and can be used to read any text (on YouTube you can find thousands of videos whose audio is synthetic). In cases where the speaker of the synthesis and the natural audio is the same, the quality perceived by the user (what today we all call UX) is very high, to such an extent that some people were not aware that they were talking to an “artificial intelligence”, or what is the same, to a machine from the beginning of the century.

Returning to the ingenious point of the transparent operators, these were very useful because, with the answers given by the operators to the system, the algorithms themselves could be trained, improving their results over time by increasing the amount of data (what we call training dataset, in our days). It was a very ingenious (and economical) way of reducing costs from the moment the IVR (Interactive voice recognition, as these systems, are called in the jargon) was implemented, without having to make a prior learning model. It was widely used in banking and telephone sales, environments where operations and interaction with the user are more limited than, for example, in customer service centres.

In that first company, I met my first co-worker, who sat at the next desk. He had previously worked in a vehicle number plate recognition system company, an industry that was much more advanced than voice recognition, and where human assistance was not required, only that there was enough light and the number plate was not too dirty or deteriorated. Today they are implemented in all the car parks we use and in the vehicles that patrol the streets of large cities to check whether we have paid for regulated parking.

Those early days of the 21st century foreshadowed what was to come, especially as smartphones began to be used on a massive scale and Google moved beyond being a search engine to include other types of applications within its services.

One of them, which, in part, made me rethink my career, was the voice recognition system included in Android systems, which was very accurate (to be fair, the quality received by a microphone on a mobile terminal is far superior to that of audio travelling through the phone), and did not require human assistance. These systems have been improved so much that, nowadays, it is perfectly possible to give voice instructions or even dictate text to voice recognition systems, with very few errors.

As for those cameras that recognise number plates, Google has once again perfected machine vision algorithms to recognise objects, people, text and almost anything else we need from our smartphones.

Google is of course not the only company to make use of such techniques, but it is an example of the evolution of artificial intelligence in the 21st century.

Companies such as those involved in the Numen’s association aim to help train people in this type of technique (and others also related to advanced analytics), but above all, to know how to apply them to real problems. One of the biggest problems associated with machine learning, and AI in general, is not that technicians do not know the tools, but that they do not know how to interpret the problems, or worse, the solutions found. The purpose of this training is to learn how to correctly approach both the issues and the results and thus be able to focus on the development of natural language processing or artificial vision applications in specific sectors. For example, we can talk about applications that, with images obtained by drones flying over our forests, allow us to know their health, identify species, leafiness or even diseases, or the use of NLP to identify fake news about cancer. Of course, this is not done by Google models but requires customised developments and training (and knowledge of how to interpret) models that give correct results.

Pablo Castiella
Managing Partner at baobab soluciones.

Follow us on social media: LinkedIn y Twitter

Leave a Reply

Your email address will not be published.