Scatter plots and whatnot in seaborn

as the title suggests, today, we will be looking into Scatter plots

Set up the notebook

you must be adept at it by now, the following code is for your reference if you are just joining us

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

Load and examine the data

We’ll work with the data about candies. you can access the dataset here

If you like, you can read more about the dataset here.In [2]:

# Path of the file to read
candy_filepath = "../input/candy.csv"

# Read the file into a variable candy_data
candy_data = pd.read_csv(candy_filepath, index_col="id")

As always, we check that the dataset loaded properly by printing the first five rows.In [3]:


The dataset contains 83 rows, where each corresponds to a different candy bar. There are 13 columns:

  • 'competitorname' contains the name of the candy bar.
  • the next 9 columns (from 'chocolate' to 'pluribus') describe the candy. For instance, rows with chocolate candies have "Yes" in the 'chocolate' column (and candies without chocolate have "No" in the same column).
  • 'sugarpercent' provides some indication of the amount of sugar, where higher values signify higher sugar content.
  • 'pricepercent' shows the price per unit, relative to the other candies in the dataset.
  • 'winpercent' is calculated from the survey results; higher values indicate that the candy was more popular with survey respondents.

Scatter plots

To create a simple scatter plot, we use the sns.scatterplot command and specify the values for:

  • the horizontal x-axis (x=candy_data['sugarpercent']), and
  • the vertical y-axis (y=candy_data['winpercent']).
# Scatter plot showing the relationship between 'sugarpercent' & 'winpercent' 

sns.scatterplot(x=candy_data['sugarpercent'], y=candy_data['winpercent'])

the scatter graph shows no correlation of sugar percentage with popularity of the candy

To double-check this relationship, you might like to add a regression line, or the line that best fits the data. We do this by changing the command to sns.regplot.In [5]:

# Scatter plot w/ regression line showing the relationship between 'sugarpercent' and 'winpercent' 

sns.regplot(x=candy_data['sugarpercent'], y=candy_data['winpercent'])
 Since the regression line has a slightly positive slope, this tells us that there is a slightly positive correlation between 'winpercent' and 'sugarpercent'. Thus, people have a slight preference for candies containing relatively more sugar.

Color-coded scatter plots

We can use scatter plots to display the relationships between (not two, but…) three variables! One way of doing this is by color-coding the points.

For instance, create a scatter plot to show the relationship between 'pricepercent' (on the horizontal x-axis) and 'winpercent' (on the vertical y-axis). Use the 'chocolate' column to color-code the points.

# Scatter plot showing the relationship between 'pricepercent', 'winpercent', and 'chocolate' 

sns.scatterplot(x=candy_data['pricepercent'], y=candy_data['winpercent'], hue=candy_data['chocolate'])

To further emphasize this fact, we can use the sns.lmplot command to add two regression lines, corresponding to chocolate and candy.

# Color-coded scatter plot w/ regression lines 

sns.lmplot(x="pricepercent", y="winpercent", hue="chocolate", data=candy_data)

The sns.lmplot command above works slightly differently than the commands you have learned about so far:

  • Instead of setting x=candy_data['pricepercent'] to select the 'pricepercent' column in candy_data, we set x="pricepercent" to specify the name of the column only.
  • Similarly, y="winpercent" and hue="chocolate" also contain the names of columns.
  • We specify the dataset with data=candy_data.

Finally, there’s one more plot that you’ll learn about, that might look slightly different from how you’re used to seeing scatter plots. Usually, we use scatter plots to highlight the relationship between two continuous variables (like "pricepercent" and "winpercent"). However, we can adapt the design of the scatter plot to feature a categorical variable (like "chocolate") on one of the main axes. We’ll refer to this plot type as a categorical scatter plot, and we build it with the sns.swarmplot command.

# Scatter plot showing the relationship between 'chocolate' and 'winpercent' 

sns.swarmplot(x=candy_data['chocolate'], y=candy_data['winpercent'])

We will use a pair of new datasets given below for our next segment

You’ll work with a real-world dataset containing information collected from microscopic images of breast cancer tumors, similar to the image below.


Each tumor has been labeled as either benign (_noncancerous_) or malignant (_cancerous_).

Load and examine the new data

#Load the datasets
cancer_b_filepath = "../input/cancer_b.csv"
cancer_m_filepath = "../input/cancer_m.csv"
cancer_b_data = pd.read_csv(cancer_b_filepath, index_col="Id")
cancer_m_data = pd.read_csv(cancer_m_filepath, index_col="Id")
# Print the first five rows of the (benign) data 

# Print the first five rows of the (malignant) data


Say we would like to create a histogram to see the two type of tumors. We can do this with the sns.distplot command.

# Histograms for benign and maligant tumors 
sns.distplot(a=cancer_b_data['Area (mean)'], label="Benign", kde=False) 
sns.distplot(a=cancer_m_data['Area (mean)'], label="Malignant", kde=False) 

We customize the behavior of the command with two additional pieces of information:

  • a= chooses the column we’d like to plot (in this case, we chose 'Area(mean)').
  • kde=False is something we’ll always provide when creating a histogram, as leaving it out will create a slightly different plot.

Malignant tumors have higher values for 'Area (mean)', on average. Malignant tumors have a larger range of potential values.

Density plots

The next type of plot is a kernel density estimate (KDE) plot. In case you’re not familiar with KDE plots, you can think of it as a smoothed histogram.

To make a KDE plot, we use the sns.kdeplot command. Setting shade=True colors the area below the curve (and data= has identical functionality as when we made the histogram above).In [4]:

# KDE plots for benign and malignant tumors 
sns.kdeplot(data=cancer_b_data['Radius (worst)'], shade=True, label="Benign") sns.kdeplot(data=cancer_m_data['Radius (worst)'], shade=True, label="Malignant")