Seaborn Complete

Seaborn Complete

seaborn

This page is the consolidation of all the tutorials posted in blog regarding seaborn data visualization.

Installing seaborn

to install seaborn library, you need to enter the following code in terminal , or windows Power Shell.

pip install seaborn
tut1_plots_you_make

Using Jupyter notebooks

Once you have installed seaborn, you can call the library in a Jupyter notebook. to learn more about jupyter notebook click here

Set up the notebook

There are a few libraries that you need to load in your Jupyter notebook, hereinafter referred to as notebook. You will to run the following code in order to load the necessary libraries (Notice that it returns as output: Setup Complete.)In [1]:

import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")
Setup Complete

Load the data

In this tutorial, we’ll work with a dataset of historical FIFA rankings for six countries: Argentina (ARG), Brazil (BRA), Spain (ESP), France (FRA), Germany (GER), and Italy (ITA). The dataset is stored as a CSV file (short for comma-separated values file.

To load the data into the notebook, we’ll be using the Pandas read_csv functionality

  • begin by specifying the location (or filepath) where the dataset can be accessed, and then
  • use the filepath to load the contents of the dataset into the notebook.
# Path of the file to read
fifa_filepath = "../input/fifaratings.csv"

# Read the file into a variable fifa_data
fifa_data = pd.read_csv(fifa_filepath, index_col="Date", parse_dates=True)

Note that the code cell above has four different lines.

Lines beginning with #

Two of the lines are preceded by a (#) and contain text that appears faded and italicized.

Both of these lines are completely ignored by the computer when the code is run, and they only appear here so that any human who reads the code can quickly understand it. We refer to these two lines as comments, and it’s good practice to include them to make sure that your code is readily interpret able.

Executable code

The other two lines are executable code, or code that is run by the computer (in this case, to find and load the dataset).

The first line sets the value of fifa_filepath to the location where the dataset can be accessed. In this case, we’ve provided the filepath for you (in quotation marks). Note that the comment immediately above this line of executable code provides a quick description of what it does!

The second line sets the value of fifa_data to contain all of the information in the dataset. This is done with pd.read_csv. a detailed tutorial on how to do this can be found here

  • fifa_filepath – The filepath for the dataset always needs to be provided first.
  • index_col="Date" – When we load the dataset, we want each entry in the first column to denote a different row. To do this, we set the value of index_col to the name of the first column ("Date", found in cell A1 of the file when it’s opened in Excel).
  • parse_dates=True – This tells the notebook to understand the each row label as a date (as opposed to a number or other text with a different meaning).

These details will make more sense soon, when you have a chance to load your own dataset

By the way, you might have noticed that these lines of code don’t have any output (whereas the lines of code you ran earlier in the notebook returned Setup Complete as output). This is expected behavior — not all code will return output, and this code is a prime example!

Examine the data

Now, we’ll take a quick look at the dataset in fifa_data, to make sure that it loaded properly.

We will first use .head() command to return the first 5 results in dataset

  • begin with the variable containing the dataset (in this case, fifa_data), and then
  • follow it with .head().

You can see this in the line of code below.In [3]:

# Prints the first 5 rows of the data
fifa_data.head()

Out[3]:

ARGBRAESPFRAGERITA
DATE
1993-08-085.08.013.012.01.02.0
1993-09-2312.01.014.07.05.02.0
1993-10-229.01.07.014.04.03.0
1993-11-199.04.07.015.03.01.0
1993-12-238.03.05.015.01.02.0

Check now that the first five rows agree with the image of the dataset (from when we saw what it would look like in Excel) above.

Plotting the data

In python, making graphs and charts is referred to as plots. You will find the term plotting data all the time. it simply means to generate graphs of the data.

Check out the following simple code that defines our graph.

# Set the width and height of the figure
plt.figure(figsize=(16,6))

# Line chart showing how FIFA rankings evolved over time 
sns.lineplot(data=fifa_data)

Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7adde8add8>

Line plots

Select a dataset

the data set for this tutorial will be the data from the Los Angeles Data Portal that tracks monthly visitors to each museum. you can filter out on the museums you are interested to work on.

Load the data

As you learned in the previous tutorial, we load the dataset using the pd.read_csv command.In [1]:

# Path of the file to read
museum_filepath = "../input/museum_visitors.csv"

# read the file into a variable museum_data
museum_data = pd.read_csv(museum_filepath, index_col="Date", parse_dates=True)

The end result of running both lines of code above is that we can now access the dataset by using museum_data.

Examine the data

We can print the first five rows of the dataset by using the head command that you learned about in the previous tutorial.In [2]:

# Print the last 5 rows of the data
museum_data.tail()

Out[2]:

Empty entries will appear as NaN, which is short for “Not a Number”.

We can also take a look at the first five rows of the data by making only one small change (where .tail() becomes .head()):In [2]:

Plot the data

Now that the dataset is loaded into the notebook, we need only a few lines of code to make a line chart!In [3]:

# Line chart showing daily global streams of each song 
# Set the width and height of the figure 
plt.figure(figsize=(12,6)) 

# Line chart showing the number of visitors to each museum over time 
sns.lineplot(data=museum_data) 

# Add title 
plt.title("Monthly Visitors to Los Angeles City Museums")

As you can see above, the line of code is relatively short and has three main components:

  • plt.figure tells the notebook to setup the figure size of the plot being generated. The first line of code sets the size of the figure to 12 inches (in width) by 6 inches (in height)
  • sns.lineplot tells the notebook that we want to create a line chart.
    • Note: Every command that you learn here will start with sns, which indicates that the command comes from the seaborn package. For instance, we use sns.lineplot to make line charts. Soon, you’ll learn that we use sns.barplot and sns.heatmap to make bar charts and heatmaps, respectively.
  • data=museum_data selects the data that will be used to create the chart.
  • plt.title tells the notebook to sets the title of the plot to whatever is written inside the brackets.

Note that you will always use this same format when you create a line chart, and the only thing that changes with a new dataset is the name of the dataset. So, if you were working with a different dataset named financial_data, for instance, the line of code would appear as follows:

sns.lineplot(data=financial_data)

Sometimes there are additional details we’d like to modify, like the size of the figure and the title of the chart. Each of these options can easily be set with a single line of code.In [3]:

Plotting a subset of the data

So far, you’ve learned how to plot a line for every column in the dataset. What if, you need to plot only a subset of the data.

In the next code cell, we plot the lines corresponding to the first two columns in the dataset.

# Set the width and height of the figure
plt.figure(figsize=(12,6)) 
# Add title 
plt.title("Monthly Visitors to Avila Adobe") 
# Line chart showing the number of visitors to Avila Adobe over time 
sns.lineplot(data=museum_data['Avila Adobe']) 
# Add label for horizontal axis 
plt.xlabel("Date")

This line looks really similar to the code we used when we plotted every line in the dataset, but it has a few key differences:

  • Instead of setting data=museum_data, we set data=museum_data['Avila Adobe']. In general, to plot only a single column, we use this format with putting the name of the column in single quotes and enclosing it in square brackets. (To make sure that you correctly specify the name of the column, you can print the list of all column names using the command you learned above.)
  • We also add label="Avila Adobe" to make the line appear in the legend and set its corresponding label.

The final line of code modifies the label for the horizontal axis (or x-axis), where the desired label is placed in quotation marks ("...").

Heatmaps & Bar plots

Select a dataset

In this tutorial, we’ll work with a dataset from IGN for game ratings starting form 1 to 10. Find the dataset hereIGN datasetsDOWNLOAD

Load the data

As before, we load the dataset using the pd.read_csv command.In [2]:

# Path of the file to read
ign_filepath = "../input/ign_reviews.csv"

# Read the file into a variable flight_data
ign_data = pd.read_csv(ign_filepath, index_col="Platform")

You may notice that the code is slightly shorter than what we used in the previous tutorial. In this case, since the row labels (from the 'platform' column) is not a date column

  • the filepath for the dataset (in this case, ign_filepath), and
  • the name of the column that will be used to index the rows (in this case, index_col="Platform").

Examine the data

Since the dataset is small, we can easily print all of its contents. This is done by writing a single line of code with just the name of the dataset.In [3]:

# Print the data
ign_data

Bar chart

Say we’d like to create a bar chart showing the average score for racing games, for each platform using seaborn.

# Set the width and height of the figure 
plt.figure(figsize=(8, 6)) 

# Bar chart showing average score for racing games by platform 
sns.barplot(x=ign_data['Racing'], y=ign_data.index) 

# Add label for horizontal axis 
plt.xlabel("") 

# Add label for vertical axis 
plt.title("Average Score for Racing Games, by Platform")

The commands for customizing the text (title and vertical axis label) and size of the figure are familiar from the previous tutorial. The code that creates the bar chart is new:

# Bar chart showing average score for racing games by platform
sns.barplot(x=ign_data['Racing'], y=ign_data.index)

It has three main components:

  • sns.barplot – This tells the notebook that we want to create a bar chart.
    • Remember that sns refers to the seaborn package, and all of the commands that you use to create charts in this series of tutorial will start with this prefix.
  • x=ign_data['Racing'] – This determines what to use on the horizontal axis. In this case, we have selected the column that shows the categories of the bargraph
  • y=ign_data.index – This sets the column in the data that will be used to determine the height of each bar.

Important Note: You must select the indexing column with ign_data.index, and it is not possible to use ign_data['Platform'] (which will return an error). This is because when we loaded the dataset, the "Platform" column was used to index the rows. We always have to use this special notation to select the indexing column.

Heatmap

Now we will learn about another chart type, Heatmaps

In the code cell below, we create a heatmap to quickly visualize patterns in ign_data. Each cell is color-coded according to its corresponding value.In [5]:

# Set the width and height of the figure
plt.figure(figsize=(10,10)) 

# Heatmap showing average game score by platform and genre 
sns.heatmap(ign_data, annot=True) 

# Add label for horizontal axis 
plt.xlabel("Genre") 

# Add label for vertical axis 
plt.title("Average Game Score, by Platform and Genre")

The relevant code to create the heatmap is as follows:

# Heatmap showing average game score by platform and genre
sns.heatmap(ign_data, annot=True)

This code has three main components:

  • sns.heatmap – This tells the notebook that we want to create a heatmap.
  • data=ign_data – This tells the notebook to use all of the entries in ign_data to create the heatmap.
  • annot=True – This ensures that the values for each cell appear on the chart. (Leaving this out removes the numbers from each of the cells!)

How to read the Heatmap?

What patterns can you detect in the table? For instance, if you look closely, the The darker cells shows concentration of games under the Genre and Platform. For lighter cell, there are fewer titles.

Scatter Plots, Histograms & Density plots

Load and examine the data

We’ll work with the data about candies. you can access the dataset herecandy.csvDOWNLOAD

If you like, you can read more about the dataset here.In [2]:

# Path of the file to read
candy_filepath = "../input/candy.csv"

# Read the file into a variable candy_data
candy_data = pd.read_csv(candy_filepath, index_col="id")

As always, we check that the dataset loaded properly by printing the first five rows.In [3]:

candy_data.head()

The dataset contains 83 rows, where each corresponds to a different candy bar. There are 13 columns:

  • 'competitorname' contains the name of the candy bar.
  • the next 9 columns (from 'chocolate' to 'pluribus') describe the candy. For instance, rows with chocolate candies have "Yes" in the 'chocolate' column (and candies without chocolate have "No" in the same column).
  • 'sugarpercent' provides some indication of the amount of sugar, where higher values signify higher sugar content.
  • 'pricepercent' shows the price per unit, relative to the other candies in the dataset.
  • 'winpercent' is calculated from the survey results; higher values indicate that the candy was more popular with survey respondents.

Scatter plots

To create a simple scatter plot, we use the sns.scatterplot command and specify the values for:

  • the horizontal x-axis (x=candy_data['sugarpercent']), and
  • the vertical y-axis (y=candy_data['winpercent']).
# Scatter plot showing the relationship between 'sugarpercent' & 'winpercent' 

sns.scatterplot(x=candy_data['sugarpercent'], y=candy_data['winpercent'])

the scatter graph shows no correlation of sugar percentage with popularity of the candy

To double-check this relationship, you might like to add a regression line, or the line that best fits the data. We do this by changing the command to sns.regplot.In [5]:

# Scatter plot w/ regression line showing the relationship between 'sugarpercent' and 'winpercent' 

sns.regplot(x=candy_data['sugarpercent'], y=candy_data['winpercent'])
 Since the regression line has a slightly positive slope, this tells us that there is a slightly positive correlation between 'winpercent' and 'sugarpercent'. Thus, people have a slight preference for candies containing relatively more sugar.

Color-coded scatter plots

We can use scatter plots to display the relationships between (not two, but…) three variables! One way of doing this is by color-coding the points.

For instance, create a scatter plot to show the relationship between 'pricepercent' (on the horizontal x-axis) and 'winpercent' (on the vertical y-axis). Use the 'chocolate' column to color-code the points.

# Scatter plot showing the relationship between 'pricepercent', 'winpercent', and 'chocolate' 

sns.scatterplot(x=candy_data['pricepercent'], y=candy_data['winpercent'], hue=candy_data['chocolate'])

To further emphasize this fact, we can use the sns.lmplot command to add two regression lines, corresponding to chocolate and candy.

# Color-coded scatter plot w/ regression lines 

sns.lmplot(x="pricepercent", y="winpercent", hue="chocolate", data=candy_data)

The sns.lmplot command above works slightly differently than the commands you have learned about so far:

  • Instead of setting x=candy_data['pricepercent'] to select the 'pricepercent' column in candy_data, we set x="pricepercent" to specify the name of the column only.
  • Similarly, y="winpercent" and hue="chocolate" also contain the names of columns.
  • We specify the dataset with data=candy_data.

Finally, there’s one more plot that you’ll learn about, that might look slightly different from how you’re used to seeing scatter plots. Usually, we use scatter plots to highlight the relationship between two continuous variables (like "pricepercent" and "winpercent"). However, we can adapt the design of the scatter plot to feature a categorical variable (like "chocolate") on one of the main axes. We’ll refer to this plot type as a categorical scatter plot, and we build it with the sns.swarmplot command.

# Scatter plot showing the relationship between 'chocolate' and 'winpercent' 

sns.swarmplot(x=candy_data['chocolate'], y=candy_data['winpercent'])

We will use a pair of new datasets given below for our next segmentcancer_mDOWNLOADcancer_bDOWNLOAD

You’ll work with a real-world dataset containing information collected from microscopic images of breast cancer tumors, similar to the image below.

ex4_cancer_image

Each tumor has been labeled as either benign (_noncancerous_) or malignant (_cancerous_).

Load and examine the new data

#Load the datasets
cancer_b_filepath = "../input/cancer_b.csv"
cancer_m_filepath = "../input/cancer_m.csv"
cancer_b_data = pd.read_csv(cancer_b_filepath, index_col="Id")
cancer_m_data = pd.read_csv(cancer_m_filepath, index_col="Id")
# Print the first five rows of the (benign) data 

cancer_b_data.head() 
# Print the first five rows of the (malignant) data
cancer_m_data.head()

Histograms

Say we would like to create a histogram to see the two type of tumors. We can do this with the sns.distplot command.

# Histograms for benign and maligant tumors 
sns.distplot(a=cancer_b_data['Area (mean)'], label="Benign", kde=False) 
sns.distplot(a=cancer_m_data['Area (mean)'], label="Malignant", kde=False) 
plt.legend()

We customize the behavior of the command with two additional pieces of information:

  • a= chooses the column we’d like to plot (in this case, we chose 'Area(mean)').
  • kde=False is something we’ll always provide when creating a histogram, as leaving it out will create a slightly different plot.

Malignant tumors have higher values for 'Area (mean)', on average. Malignant tumors have a larger range of potential values.

Density plots

The next type of plot is a kernel density estimate (KDE) plot. In case you’re not familiar with KDE plots, you can think of it as a smoothed histogram.

To make a KDE plot, we use the sns.kdeplot command. Setting shade=True colors the area below the curve (and data= has identical functionality as when we made the histogram above).In [4]:

# KDE plots for benign and malignant tumors 
sns.kdeplot(data=cancer_b_data['Radius (worst)'], shade=True, label="Benign") sns.kdeplot(data=cancer_m_data['Radius (worst)'], shade=True, label="Malignant")

DENSTY GRAPHHISTOGRAM