You must have noticed by now, that I am not very active man. In order to avoid doing most of the work and spending more time analyzing. I have been looking for automatic tools. like my previous article about AutoML. if you haven’t had the chance to read it, you can check it out here and now here is my guide to Exploratory Data Analysis.
What is Exploratory Data Analysis (EDA)?
Exploratory Data Analysis is listed as an important step in most methodologies for data analysis
(Biecek, 2019; Grolemund and Wickham, 2019).
One of the most popular methodologies, the CRISP-DM
(Wirth, 2000), lists the following phases of a data mining project:
- Business understanding.
- Data understanding.
- Data preparation.
Exploratory Data Analysis or EDA for short is the technique of reviewing the datasets to get some idea of the data. It covers the second step in the list. Its main purpose is to get knowledge about the components of the data and see any patterns that might already exist in data.
Last year, I was working with a dataset that had around 700,000 records and some 50 odd columns. For me, it was quite a big dataset. I know people have seen much bigger datasets. So in order to get some sense what the data is? I had to do some initial investigation. One way of doing so it to go ahead and run the tests for data like calculate mean, median & mode. run some plots to visualize the frequencies and get some insights, create a word cloud (like the one above) to see most used word etc.
All these steps are cool and have their very own commands and packages in R and Python, but I was clueless as to what to run? and why? because I knew nothing about the data. Being the lazy person I am. I thought to outsource this task. My obvious choice was the packages that hardworking and intelligent people have been making for lazy people like me. 🙂
I did some research online and found a bunch of packages that can do just the job for me. I got excited and ended up checking all the packages. so here is my list of Packages and how to use them in R.
The arsenal package
The arsenal package is a set of four tools for data exploration:
- . table of descriptive statistics and p-values of associated statistical tests, grouped by levels of a target variable (the so-called Table 1). Such a table can also be created for paired observation, for example longitudinal data (tableby and paired functions),
- comparison of two data frames that can detect shared variables (compare function),
- frequency tables for categorical variables (freqlist function),
- fitting and summarizing simple statistical models (linear regression, Cox model, etc) in tables of estimates, confidence intervals and p-values (modelsum function).
Results of each function can be saved to a short report using the write2 function. An example5 can be found in Figure 2.
A separate vignette is available for each of the functions. arsenal is the most statistically-oriented package among reviewed libraries. It borrows heavily from SAS-style procedures used by the authors at the Mayo Clinic.
this package is not available in cran and needs to be installed from Git hub. In order to install this package following code needs to be run
running the package you will get the analysis of the dataset. I was expecting some more features from this package but anyways. The DataOverview function returns a data frame that describes each feature by its type, number of missing values, outliers and typical descriptive statistics. Values proposed for imputation are also included. Two outlier detection methods are available:
A PDF report can be created using the autoEDA function. It consists of the plots of distributions of predictors grouped by outcome variable or distribution of outcome by predictors.
This was one of my instant favorites. Data explorer provides whole dataset summary: dimensions, types of variables, missing values, etc and it uses a lot of plots to explain the data, scatter plot, box plot heat matrix you name it.
It also has this create_report function generates a summary report with all these cool plots.
The dataMaid package has two central functions: the check function,
which performs checks of data consistency and validity, and summarize, which summarizes each column. Another function, makeDataReport, automatically creates a report in PDF, DOCX or HTML format. The goal is to detect missing and unusual – outlying or incorrectly encoded – values. The report contains whole dataset summary: variables and their types, number of missing values, and uni-variate summaries in the form of descriptive statistics, histograms/bar plots and an indication of
The dlookr package provides tools for various types of analysis ranging from correctness, missing values, outlier detection; exploratory data analysis, imputation, dichotomization, and transformation of continuous features.
It also has the functionality of making a PDF report for all these analyses.
Other packages for EDA are as follows
If you want to know more regarding, about these please let me know in the comments below and I will do another post explaining these packages.