Lazy man’s guide to Automatic Machine Learning with H2O
First there was machine learning (ML). As time passed and lazy folks like me decided to join the bandwagon for machine learning. Someone felt the need for automatic machine learning to reduce the time in learning the tools and spending more time on analyzing the results.
Thus, H2O AutoML was born, what follows is an introduction to this awesome tool.
Excuse me? what is this h2o again?
As already explained, the lazy people joined the band for machine learning. Now they are no expert in algorithm selection and feature selection. so what do they do?
they get a tool that is easy as a pie to use and do the work of an expert. H2O comes into play. with their easy to use interface an availability on your language of choice. Now, H2O wanted to be cross platform, so what good language apart from Java will be? so you guessed it right. Its based on JVM. the below picture shows the architecture of application.
H2O’s core code is written in Java that enables the whole framework for multi-threading. Although it is written in Java, you get the interface in R, Python and web.
In essence, we can say that H2O is an open source, in memory, distributed, fast and scalable machine learning and predictive analytics that allow building machine learning models to be an ease.
Installation process – R
wait a minute? what is R? its a language that we will discuss later on in our future posts.
If you want to use H2O functionality in R, you can simply install package H2O using command
## ## ---------------------------------------------------------------------- ## ## Your next step is to start H2O: ## > h2o.init() ## ## For H2O package documentation, ask for help: ## > ??h2o ## ## After starting H2O, you can use the Web UI at http://localhost:54321 ## For more information visit http://docs.h2o.ai ## ## ---------------------------------------------------------------------- ## ## Attaching package: 'h2o' ## The following objects are masked from 'package:stats': ## ## cor, sd, var ## The following objects are masked from 'package:base': ## ## %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames, ## colnames<-, ifelse, is.character, is.factor, is.numeric, log, ## log10, log1p, log2, round, signif, trunc
## Connection successful! ## R is connected to the H2O cluster: H2O cluster uptime: 4 seconds 473 milliseconds H2O cluster timezone: Asia/Karachi H2O data parsing timezone: UTC H2O cluster version: 126.96.36.199 H2O cluster version age: 3 months and 7 days H2O cluster name: H2O_started_from_R_SALMAN_avd615 H2O cluster total nodes: 1 H2O cluster total memory: 1.75 GB H2O cluster total cores: 4 H2O cluster allowed cores: 4 H2O cluster healthy: TRUE H2O Connection ip: localhost H2O Connection port: 54321 H2O Connection proxy: NA H2O Internal Security: FALSE H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 R Version: R version 3.5.3 (2019-03-11)
Initializing H2O might throw an error in your system in the case where you don’t have JDK. So go get the JDK to resolve the issue
Installation process – Python
If you are using python the same method is applied in it too, from command line
pip install -U h2o
and h2o will be installed for your python environment. The same process will go on for Initializing h2o.
import h2o from h2o.automl import H2OAutoML h2o.init()
The h2o.init() command is pretty smart and does a lot of work. At first, it looks for any active h2o instance before starting a new one. Then starts a new, when instance is not present.
It does have arguments which helps to accommodate resources to the h2o instance frequently used are:
- nthreads: By default, the value of nthreads will be -1 which means the instance can use all the cores of the CPU, we can set the number of cores utilized by passing the value to the argument.
- max_mem_size: By passing a value to this argument you can restrict the maximum memory allocated to the instance. Its od string type can pass an argument as ‘2g’ or ‘2G’ for 2 GBs of memory, same when you want to allocate in MBs.
- Upload data directly
- View data uploaded by the client
- Create models directly
- View models created by you or your client
- view predictions
- Run predictions directly
The interface is quite useful and provides an ease of use to non-experts. I would recommend to try it and perform some experiments of your own.
Now talking about AutoML. This feature in H2O helps in automatic training and tuning of many models within a user-specified time limit.
The current version of AutoML function can train and cross-validate the following algorithms
- Cox Proportional Hazards (CoxPH)
- Deep Learning (Neural Networks)
- Distributed Random Forest (DRF)
- Generalized Linear Model (GLM)
- Generalized Additive Models (GAM)
- Gradient Boosting Machine (GBM)
- Naïve Bayes Classifier
- Stacked Ensembles
- Support Vector Machine (SVM)
- Generalized Low Rank Models (GLRM)
- Isolation Forest
- K-Means Clustering
- Principal Component Analysis (PCA)
When we say AutoML. It should cater to all aspects of data preparation, Model generation, and Ensembles. It should also provide few parameters as possible so that users can perform tasks with much less confusion. H2o AutoML does perform this task with ease and the minimal parameter passed by the user.
We will get into more details about its usage in future posts.