What Is Automated Machine Learning (AutoML) and Is It About to Put Data Scientists out of Work?

in Press

By Oliver Tearle

Since Automated Machine Learning (AutoML) tools, such as Google’s AutoML, launched, experts have been exploring whether they are ready for full enterprise integration and application. AutoML tools claim to let anyone take on the role of a ‘citizen data scientist’, capable of producing production ready, machine learning models, without the technical background traditionally required.

While it’s certainly true that automated machine learning processes are changing the way businesses are able to perform data analysis tasks, the technology is not yet ready to put data scientists out of work. One of the technology’s central claims is that auto generated models are of similar quality and are produced in a fraction of the time, compared to an equivalent model that has been created manually by a team of data scientists.

While AutoML models are quicker to produce, they are only effective when the problem they are looking to solve is fixed and repetitive. Most AutoML models perform well and achieve consistent quality in this setting; but the more challenging data problems are, the more this requires data scientist intervention to take what the AutoML system has started and turn it into something usable. In order to understand some of these limitations, let’s look at the AutoML process in more detail.

AutoML tools streamline the data science process, doing the best it can with the information it has available. There are three main stages to the process:

The first phase involves information ‘mining’ to assist with increasing performance of the generated models, by creating more information for it to learn from. This is very time consuming to do manually, as the data scientist needs to uncover relationships between the data elements and devise ways of exploiting the insight as additional data fields for the machine to pick up on during training.

This is an important phase as this additional data very often means the difference between an unsuitable and an excellent model. AutoML is programmed to try a limited range of data discovery techniques, usually in a way that caters to the ‘average’ data problem, limiting the eventual performance of the model, as it is unable to use SME knowledge that can be essential to success – something which a data scientist can use to their advantage.

Many data science problems start with significant manual effort going into selecting the data to present to an algorithm. Throwing all the data you have at the system will result in a sub-par model, as there are usually many different, often conflicting signals in the data, which need to be targeted and modelled individually.

This is especially true with fraud, where different geographical regions, payment channels etc have vastly differing fraud problems. The manual effort to discover these patterns and design appropriate data sets to allow for accurate detection is still largely un-automated. Taking a multipurpose automated approach to this problem is currently not possible due to the enormous complexity of such an undertaking.

The next phase is model generation. Models with various configurations are created and trained using the data from the previous stage. This is critical as it is almost impossible to use a default configuration for every problem and get the best results.

AutoML has the edge over data scientists here, as it is capable of producing an enormous number of test models, in a very short period of time. The majority of AutoML systems aim to be general purpose and only produce deep neural networks, which can be overkill for many problems, where a simple model, such as logistic regression or decision trees, may be more suitable, but would still benefit from hyperparameter optimisation.

The final phase is bulk performance testing and selection of the best performer. It is at this stage where some manual input is required, not least because it is imperative that the user selects the right model for the task. It’s no use having a fraud risk model, which detects 100% of a fraud problem, but challenges every authorisation.

In the current manual process, the data scientists work with SMEs to understand the data and develop effective descriptive data features. This essential link between SME and data scientist is something that is missing from general purpose AutoML. As described earlier, the process attempts to generate these models automatically from what the tool can discover in the data, which may not be appropriate, leading to poorly performing models. Future AutoML systems should be designed around this, and other constraints, in order to produce quality models at data scientist created standards.

The future of AutoML

AutoML continues to be developed and there have been some large improvements driven by the main current AutoML providers; Google and Microsoft. Those developments have focused mainly on improving the speed of generating production ready models, rather than exploring how the technology can be improved for more difficult problems (fraud and network intrusion detection, for example), where AutoML can only go so far, before data scientist input is required.

As AutoML solutions continue to develop and expand, more complex manual process will become possible to automate. Current AutoML systems work extremely well on image and speech processing as there is SME knowledge embedded within AutoML to be able to perform these tasks so well. Future AutoML systems will have the ability for the business user to input their knowledge to aid the machine in generating very accurate models automatically.

On top of this, complex data science pipelines will become increasingly streamlined, and adding a large variety of algorithms to optimise will further expand the possible varieties of problems citizen data scientists will be able to tackle.

Although many data science tasks will become automated, this will enable data scientists to perform bespoke tasks for the business; further driving innovation and allowing business to focus on more important revenue generation and business growth activities.