Everything you need to know about AutoML!

AutoML — A survey of its evolution and State-of-the-Art

Arpithagurumurthy
14 min readNov 26, 2021

Tired of figuring out which model to use for your use case? Tired of feature selection and hyperparameter tuning? Worry no more, and welcome to the world of AutoML! AutoML integrates automation and Machine Learning and covers the entire pipeline, from processing the raw dataset to spitting out the best-performing model for your data.

Fig 1: https://datatechvibe.com/ai/why-does-your-business-need-automl/

Deep Learning (DL) methods have proven to perform remarkably well on image classification, object detection, and language modeling tasks. So, what’s stopping DL techniques from being widely employed? It’s the sheer development cost that is incurred since it heavily relies on human expertise. Even experts require huge amounts of resources and time to build good quality models. This brings us to the concept of AutoML which helps in removing the need for humans in the loop by automating the entire ML pipeline.

AutoML is the process of automating the construction of an ML pipeline on a limited budget by enabling even infants in the field to build ML applications.

A complete AutoML system is a beginner-friendly end-to-end ML pipeline. It consists of four main processes including data preparation, feature engineering, model generation, and model evaluation.

Fig 2: An overview of the AutoML pipeline

Model generation is further divided into search space and optimization methods.

  • The search space defines the design principles of ML models, which are again divided into traditional ML models and neural architectures.
  • The optimization methods are classified into hyperparameter optimization (HPO) which indicates training-related parameters (e.g., the learning rate and batch size), and architecture optimization (AO) which indicates the model-related parameters (e.g., the number of layers for neural architectures and the number of neighbors for KNN).

Another important concept in the AutoML pipeline is Neural Architecture Search (NAS), which aims to search for a robust neural architecture. It consists of three important components: the search space of neural architectures, AO methods, and model evaluation methods. We’ll discuss more on NAS later in the article.

This article provides an overview of state-of-the-art (SOTA) in AutoML. We will learn about the different AutoML processes including data preparation, feature engineering, hyperparameter optimization, and neural architecture search (NAS) as shown in Fig 2. We will also understand the NAS algorithms’ components and their performance on the CIFAR-10 dataset. The goal is to enable even beginners to get introduced to AutoML and learn how it can help in building a better future.

Stage 1: Data Preparation

Data Preparation is a building block of an ML pipeline. The three main aspects are data collection, data cleaning, and data augmentation.

Fig 3: Flowchart for data preparation

1. Data Collection

This step involves building or enhancing an existing dataset. When ML was in its infancy, a handwritten dataset called MNIST was built. Later, bigger datasets like CIFAR 10, CIFAR 100, and ImageNet were developed. The most important part of building an ML system is having the right dataset. This is a very challenging task as it would require several resources. To solve this problem, data searching, and data synthesis were introduced.

  • Data Searching: The most common source of data is web data and searching the web data is tough given that it is inexhaustible. Additionally, web results may not exactly match the search query, and web data may be unlabeled or incorrectly labeled. To solve for incorrect labels, active and semi-supervised learning methods can be employed. Dataset imbalance is another problem when it comes to web data. Symmetric Minority Oversampling Technique (SMOTE) can help alleviate the imbalance by creating new minority samples.
  • Data Synthesis: Data simulators can be used to mimic real-world data during the research phase. For example, data simulators are extremely beneficial for Autonomous driving given the safety hazards. A toolkit called OpenAI gym is useful for developing and comparing different algorithms. With the help of this toolkit, developers can focus their energy on writing the algorithm instead of worrying about generating data. Another very useful technique for generating an image, tabular, and text data is Generative Adversarial Networks (GANs).

2. Data Cleaning

Data that is collected from the web or any other source is bound to have noise, and noise can negatively affect what the model learns. Hence data cleaning is a critical step in data preparation. Traditionally this process was carried out through crowdsourcing and required expert knowledge. This was very expensive as access to specialists was limited. Hence the shift from crowdsourcing to automation became necessary. Many methods like BoostClean and AlphaClean were introduced to automate data cleaning on static data. As for dynamic data, meta-learning techniques were applied.

3. Data Augmentation

This process can be considered as an extension to data collection as it involves generating new samples based on existing data. Data Augmentation (DA) also helps in generalizing the model by avoiding overfitting since it essentially focuses on generating more data for model training. This enhances the model robustness and in turn the performance.

Fig 4: Classification of data augmentation techniques

Let us learn about DA techniques available for different modes of data.

For image data,

  • Affine transformations include rotation, scaling, random cropping, and reflection
  • Elastic transformations include contrast shift, brightness shift, blurring, and channel shuffle
  • Advanced transformations involve random erasing, image blending, cutout, and mix-up among others
  • Neural-based transformations include adversarial noise, neural style transfer, and GAN techniques

For textual data, augmentation can be performed using synonym insertion, paraphrasing/summarization, or translating the text into a foreign language and then translating it back to the original language, to state a few.

AutoAugment is an effort to automate the process of DA using Reinforcement Learning. Since it requires extensive GPU hours for augmentation searches, many optimized approaches have been proposed such as gradient descent-based, Bayesian-based optimization, online hyperparameter learning, greedy-based search, and random search among others.

Stage 2: Feature Engineering

It is a process of extracting features from raw data that the models can understand. The three main aspects of feature engineering include feature selection, feature extraction, and feature construction. Feature extraction and selection involve reducing the dimensionality/number of features, while feature construction deals with increasing this dimensionality. Automatic feature engineering involves the combination of these three processes.

1. Feature Selection

The main purpose of feature selection is to eliminate redundant and irrelevant features and thereby reduce feature dimensions. This helps in simplifying the model complexity while also increasing its robustness and performance. This process involves four main steps as shown in Fig 5.

Fig 5: Flowchart for feature selection

Generation (Search Strategy):

  • The three types of algorithms used in search strategy are complete search, heuristic search, and random search.
  • The heuristic search involves Sequential Forward Selection (SFS), Sequential Backward Selection (SBS), and Bidirectional search (BS).
  • As the name suggests, SFS and SBS follow a sequential order for adding and removing features. BS performs both SFS and SBS until they result in the same subset.
  • As for random search methods, the most commonly used are Simulated Annealing (SA) and Genetic Algorithms (GAs).

Subset Evaluation:

It consists of three categories explained as follows:

  • The first category is the filter method which assigns a score to each feature based on its correlation to other features. Feature selection is then done based on a particular threshold value. Variance, correlation coefficient, chi-square test, and mutual information are commonly used methods for scoring features.
  • The second is the wrapper method that uses classification accuracy as the threshold.
  • The last category is the embedded method where the selection is random as part of learning. Regularization and decision tree are some examples.

2. Feature Construction

This process involves creating new features from raw data thereby increasing the number of samples. This in turn helps in increasing the model generalizability and performance. Preprocessing transformation is one of the most used methods for feature construction. This includes standardization, normalization, and feature discretization.

  • For Boolean features, transformation operations like conjunctions, disjunctions, and negation are used.
  • For numerical features, operations like minimum, maximum, addition, subtraction, mean are used.
  • Cartesian product is commonly used for nominal features.

It is extremely tedious to manually figure out which operations are most suitable for different use cases. To optimize this search for operation combinations many methods were proposed. Decision tree-based methods, genetic algorithms, annotation-based approaches are some automated methods for feature construction. Feature selection techniques are then applied to evaluate the effectiveness of the newly constructed features.

3. Feature Extraction

The essence of feature extraction is dimensionality reduction. Unlike feature selection, original features are allowed to be altered during this process.

The most popular approaches that include automating this process are Principal Component Analysis (PCA), Independent Component Analysis, isomap, non-linear dimensionality reduction, Linear Discriminant Analysis (LDA), and feed-forward neural network-based approaches.

Stage 3: Model Generation

The 2 broad categories of models are traditional ML models and Deep Neural Networks (DNN). Model generation is mainly divided into search space and optimization methods. Search space defines the structure of the model that can be designed and optimization methods involve optimizing two types of parameters. The first is hyperparameters for training such as the learning rate and the second is hyperparameters for model design such as filter size and number of layers for DNN.

Fig 6: Overview of NAS pipeline

1. Search Space:

The search space essentially defines the design principles of the various neural network architectures. It is the space within which the AO methods can operate and explore. The commonly used search spaces are as follows:

Fig 7: Entire-structured search space

Entire-structured Search Space:

This space is relatively the most straightforward. Figure 7 indicates the simplest structure on the left and a slightly more complex structure with skip connections on the right. These models are built by stacking a predetermined number of layers that perform a specific operation. The problem with the models in this search space is that they lack transferability.

Cell-based Search Space:

Fig 8: Cell-based Search Space

To resolve the issue of transferability of the entire-search-based architectures, cell-based search space was proposed. These architectures are made up of a fixed set of repeating blocks. These blocks are also referred to as a cell or motif. In this approach, the problem of searching for a neural architecture is reduced to searching for an ideal cell structure pertaining to a use case. These cells can further be divided into normal and reduction cells. The output of the normal cell will have the same dimension as the input and the output of the reduction cell would have a height and width halved and channels doubled. Figure 8 shows an example of a cell-based architecture with 3 similar blocks, each with n normal cells and one reduction cell.

Fig 9: Hierarchical Search Space

Hierarchical Search Space:

While cell-based approaches provide transferability, they emphasize more on cell-level operations and ignore the network-level features. Figure 9 indicates a network-level search space, where grey and orange points represent the entire search space. The arrows represent the selected network structure. The “d” and “L” indicate the downsampling rate and layer, respectively.

Fig 10: Morphism-based Search Space

Morphism-based Search Space:

It deals with designing neural architectures from existing ones by inserting identity morphism transformations between the layers. These transformations do not change the functionality but allow altering the layers by inserting an equivalent model that is deeper and wider. Figure 10 shows Depth IdMorph and Width IdMorph transformations on a model.

2. Architecture Optimization:

The second step in the model generation process is architecture optimization which defines the path to search for the best performing model architecture. Model architecture selection is a labor-intensive task as it requires extensive hyperparameter tuning, human expertise, and several resources. AO methods are extremely beneficial since they automate this entire process. Some of the commonly used AO methods are:

Evolutionary Algorithm (EA): This approach is inspired by the biological evolution process. It can solve complex problems efficiently as it follows a global optimization process. It has 4 steps:

Fig 11: Flowchart for an evolutionary algorithm
  • Selection: This step focuses on selecting a subset of the generated architectures in the search space to perform crossover. The goal is to retain strong architectures while discarding the weak ones.
  • Crossover: From this subset of networks resulting from the selection process, every 2 networks are combined to form an offspring network. This offspring network would possess the strengths of its parent networks.
  • Mutation: Mutation refers to applying a set of predefined operations such as altering the learning rate and removing skip connections. This promotes exploration and offers diversity to the architecture.
  • Update: Now that we have many new and robust networks, the low-performing and old networks are removed periodically leaving behind only robust architectures.

Reinforcement Learning (RL)

Typically in RL-based NAS, an RNN acts as the agent. The agent performs an action at each time step to explore a new neural architecture from the search space. It then receives a new observation and a reward to update its exploration strategy to find the most optimal architecture.

Gradient Descent (GD)

The EA and RL approaches explore the discrete search space to find neural architectures. The gradient descent-based approaches were the first ones to explore the continuous and differential search space. An algorithm called DARTS was the pioneer in the GD-based approach which converted this search task into a joint optimization of neural architecture and the weights of this architecture.

Surrogate Model-based Optimization (SMBO)

The SMBO optimizes the architecture search by first building a surrogate model based on the requirements of the objective function. It then uses this surrogate model to predict the most optimal architecture and hence shortens the search time.

Hybrid Optimization Method

All previously mentioned approaches have their strengths and weaknesses. Hybrid optimization approaches were introduced to combine different approaches and capture their strengths resulting in a better search strategy. Some of the examples include EA+RL which integrated RL-based mutations with EA, EA+GD combined EA and GD-based methods, EA+SMBO used RF as a surrogate, and GD+SMBO which used variational autoencoder as a surrogate.

3. Hyperparameter Optimization

Once we have the most optimal architecture after the AO stage, it becomes essential to finetune this network with different values of hyperparameters and find the most suitable values. Some methods are:

Grid and Random Search (GS and RS): GS takes all points in the search space into account after dividing the space into equal intervals and selects the best hyperparameter values for the network. RS selects the best values from a subset of randomly drawn values for the network.

Bayesian Optimization (BO): BO follows the SMBO approach and builds a surrogate model by mapping the hyperparameters with the requirements on the validation set. It balances the exploration and exploitation processes. Exploration refers to exploring a variety of hyperparameters and exploitation refers to investing resources to further evaluate the promising hyperparameters.

Gradient-based Optimization (GO): GO methods use gradient information to finetune and evaluate the hyperparameters to find the most optimal values.

Stage 4: Model Evaluation

After the new model has been selected, we need to evaluate its performance. Below are some algorithms that can accelerate the process of model evaluation:

  • Low fidelity: Model training time is directly proportional to the size of its dataset. Model evaluation time can be reduced by smartly altering the dataset size. For image data, we can reduce the number of images or the resolution of these images while evaluating the model performance. We can also reduce the model size and compare its performance.
  • Weight Sharing: The concept of weight sharing can tremendously reduce the time required for NAS. The core idea is to reuse the knowledge gained by the previous tasks to ramp up the training time of the current model design.
  • Early stopping: The concept of early stopping was first introduced to avoid model overfitting during training. However, it proved to be very beneficial to accelerate the model evaluation time by terminating the evaluations when the model starts to perform poorly on the evaluation data.

NAS Performance Comparison

Different NAS studies have proposed different neural architectures for different scenarios. For the sake of valid comparison, accuracy and
algorithm efficiency are used as comparison indices. GPU Days are taken into account to measure efficiency and is defined as:

GPU Days = N*D, 
where N represents the number of GPUs, and D represents the actual number of days spent searching.

The below table summarizes the performance of different NAS approaches on the CIFAR-10 dataset (The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class).

Fig 12: Performance of different NAS algorithms on CIFAR-10

We can gather the below information from the table:

  • The early research on EA- and RL-based NAS methods focused more on high performance rather than on resource consumption.
  • The later research on NAS attempted to improve search efficiency by not compromising on model performance.
  • For example, EENA utilized the mutation and crossover operations of EA which reused the learned information in the evolution process thereby improving the efficiency of EA-based NAS methods.
  • Similarly, ENAS was the first RL-based NAS approach that adopted a parameter-sharing strategy, thereby reducing the number of GPU budgets to 1 and the searching time to less than a day.
  • Gradient descent-based approaches substantially reduced the resource consumption for searching and achieved SOTA results.

Open Problems and Future Directions

Let us now see the problems associated with existing AutoML methods and suggest directions for further research:

  • The existing search spaces as discussed in this article have proven to be efficient but require some degree of human interference. This introduces bias. Designing a search space that is free from human bias can generate novel model architectures. This would be a huge milestone in the AutoML space.
  • When it comes to NLP tasks, the NAS community is yet to produce models that can match up to human-designed ones. This is surely an area for improvement and requires more research.
  • Another important direction for further research is to learn how to interpret the results of AutoML mathematically.
  • Reproducibility is an ever-challenging problem in ML which is also carried over to AutoML. The number of resources consumed to run an AutoML process is very high, and more research is required to develop optimized strategies.
  • The datasets used in AutoML research (CIFAR-10 and ImageNet) are mostly labeled and well-structured. However, data in the real world is almost chaotic. This can negatively impact the performance of AutoML. This showcases the need to build AutoML systems that can produce robust architectures.

We’ve seen various AutoML applications in this article, from data preparation to model evaluation. We’ve also shed some light on the less explored areas which might need some attention. With that being said, it is worthy to note that AutoML is a powerful invention that allows even beginners in the field of ML to build and deploy AI applications.

AutoML marks a huge milestone in taking the world towards self-improving AI.

Professor Vijay Eranti — Thank you so much for your guidance.

Also a big thank you to the authors of “AutoML: A Survey of the State-of-the-Art” for the details.

References

https://arxiv.org/pdf/1908.00709.pdf

--

--