First there was machine learning (ML). As time passed and lazy folks like me decided to join the bandwagon for machine learning. Someone felt the need for automatic machine learning to reduce the time in learning the tools and spending more time on analyzing the results.

Thus, H2O AutoML was born, what follows is an introduction to this awesome tool.

Excuse me? what is this h2o again?

A Gentle Introduction to H2O AutoML - Sefik Ilkin Serengil

As already explained, the lazy people joined the band for machine learning. Now they are no expert in algorithm selection and feature selection. so what do they do?

they get a tool that is easy as a pie to use and do the work of an expert. H2O comes into play. with their easy to use interface an availability on your language of choice. Now, H2O wanted to be cross platform, so what good language apart from Java will be? so you guessed it right. Its based on JVM. the below picture shows the architecture of application.

Source

H2O’s core code is written in Java that enables the whole framework for multi-threading. Although it is written in Java, you get the interface in R, Python and web.

In essence, we can say that H2O is an open source, in memory, distributed, fast and scalable machine learning and predictive analytics that allow building machine learning models to be an ease.

Installation process – R

wait a minute? what is R? its a language that we will discuss later on in our future posts.

If you want to use H2O functionality in R, you can simply install package H2O using command install.packages("h2o").

library(h2o)
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
##     > h2o.init()
##
## For H2O package documentation, ask for help:
##     > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
##
## ----------------------------------------------------------------------

##
## Attaching package: 'h2o'

## The following objects are masked from 'package:stats':
##
##     cor, sd, var

## The following objects are masked from 'package:base':
##
##     %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc
h2o.init()
##  Connection successful!
##
R is connected to the H2O cluster: 
    H2O cluster uptime:         4 seconds 473 milliseconds 
    H2O cluster timezone:       Asia/Karachi 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.28.0.4 
    H2O cluster version age:    3 months and 7 days  
    H2O cluster name:           H2O_started_from_R_SALMAN_avd615 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.75 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
    R Version:                  R version 3.5.3 (2019-03-11) 

Initializing H2O might throw an error in your system in the case where you don’t have JDK. So go get the JDK to resolve the issue

Installation process – Python

If you are using python the same method is applied in it too, from command line 

pip install -U h2o

and h2o will be installed for your python environment. The same process will go on for Initializing h2o.

import h2o
from h2o.automl import H2OAutoML

h2o.init()

The h2o.init() command is pretty smart and does a lot of work. At first, it looks for any active h2o instance before starting a new one. Then starts a new, when instance is not present.

It does have arguments which helps to accommodate resources to the h2o instance frequently used are:

  • nthreads: By default, the value of nthreads will be -1 which means the instance can use all the cores of the CPU, we can set the number of cores utilized by passing the value to the argument.
  • max_mem_size: By passing a value to this argument you can restrict the maximum memory allocated to the instance. Its od string type can pass an argument as ‘2g’ or ‘2G’ for 2 GBs of memory, same when you want to allocate in MBs.

Once instance is created, you can access the flow by typing http://localhost:54321 in your browser. Flow is the name of the web interface that is part of h2o which does not require any extra installations. It is written in CoffeeScript(a JavaScript like language). You can use it for doing the following things:

  • Upload data directly
  • View data uploaded by the client
  • Create models directly
  • View models created by you or your client
  • view predictions
  • Run predictions directly

The interface is quite useful and provides an ease of use to non-experts. I would recommend to try it and perform some experiments of your own.

AutoML

Now talking about AutoML. This feature in H2O helps in automatic training and tuning of many models within a user-specified time limit.

The current version of AutoML function can train and cross-validate the following algorithms

Supervised

  • Cox Proportional Hazards (CoxPH)
  • Deep Learning (Neural Networks)
  • Distributed Random Forest (DRF)
  • Generalized Linear Model (GLM)
  • Generalized Additive Models (GAM)
  • Gradient Boosting Machine (GBM)
  • Naïve Bayes Classifier
  • Stacked Ensembles
  • Support Vector Machine (SVM)
  • XGBoost

Unsupervised

  • Aggregator
  • Generalized Low Rank Models (GLRM)
  • Isolation Forest
  • K-Means Clustering
  • Principal Component Analysis (PCA)

Miscellaneous

  • Word2vec

When we say AutoML. It should cater to all aspects of data preparation, Model generation, and Ensembles. It should also provide few parameters as possible so that users can perform tasks with much less confusion. H2o AutoML does perform this task with ease and the minimal parameter passed by the user.

We will get into more details about its usage in future posts.