as the title suggests, today, we will be looking into Scatter plots
Set up the notebook
you must be adept at it by now, the following code is for your reference if you are just joining us
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Load and examine the data
We’ll work with the data about candies. you can access the dataset here
If you like, you can read more about the dataset here.In :
# Path of the file to read candy_filepath = "../input/candy.csv" # Read the file into a variable candy_data candy_data = pd.read_csv(candy_filepath
As always, we check that the dataset loaded properly by printing the first five rows.In :
The dataset contains 83 rows, where each corresponds to a different candy bar. There are 13 columns:
'competitorname'contains the name of the candy bar.
- the next 9 columns (from
'pluribus') describe the candy. For instance, rows with chocolate candies have
'chocolate'column (and candies without chocolate have
"No"in the same column).
'sugarpercent'provides some indication of the amount of sugar, where higher values signify higher sugar content.
'pricepercent'shows the price per unit, relative to the other candies in the dataset.
'winpercent'is calculated from the survey results; higher values indicate that the candy was more popular with survey respondents.
To create a simple scatter plot, we use the
sns.scatterplot command and specify the values for:
- the horizontal x-axis (
- the vertical y-axis (
# Scatter plot showing the relationship between 'sugarpercent' & 'winpercent'
the scatter graph shows no correlation of sugar percentage with popularity of the candy
To double-check this relationship, you might like to add a regression line, or the line that best fits the data. We do this by changing the command to
# Scatter plot w/ regression line showing the relationship between 'sugarpercent' and 'winpercent'
Color-coded scatter plots
We can use scatter plots to display the relationships between (not two, but…) three variables! One way of doing this is by color-coding the points.
For instance, create a scatter plot to show the relationship between
'pricepercent' (on the horizontal x-axis) and
'winpercent' (on the vertical y-axis). Use the
'chocolate' column to color-code the points.
# Scatter plot showing the relationship between 'pricepercent', 'winpercent', and 'chocolate'
sns.scatterplot(x=candy_data['pricepercent'], y=candy_data['winpercent'], hue=candy_data['chocolate'])
To further emphasize this fact, we can use the
sns.lmplot command to add two regression lines, corresponding to chocolate and candy.
# Color-coded scatter plot w/ regression lines
sns.lmplot(x="pricepercent", y="winpercent", hue="chocolate", data=candy_data)
sns.lmplot command above works slightly differently than the commands you have learned about so far:
- Instead of setting
x=candy_data['pricepercent']to select the
candy_data, we set
x="pricepercent"to specify the name of the column only.
hue="chocolate"also contain the names of columns.
- We specify the dataset with
Finally, there’s one more plot that you’ll learn about, that might look slightly different from how you’re used to seeing scatter plots. Usually, we use scatter plots to highlight the relationship between two continuous variables (like
"winpercent"). However, we can adapt the design of the scatter plot to feature a categorical variable (like
"chocolate") on one of the main axes. We’ll refer to this plot type as a categorical scatter plot, and we build it with the
# Scatter plot showing the relationship between 'chocolate' and 'winpercent'
We will use a pair of new datasets given below for our next segment
You’ll work with a real-world dataset containing information collected from microscopic images of breast cancer tumors, similar to the image below.
Each tumor has been labeled as either benign (_noncancerous_) or malignant (_cancerous_).
Load and examine the new data
#Load the datasets cancer_b_filepath = "../input/cancer_b.csv" cancer_m_filepath = "../input/cancer_m.csv" cancer_b_data = pd.read_csv(cancer_b_filepath, index_col="Id") cancer_m_data = pd.read_csv(cancer_m_filepath, index_col="Id")
# Print the first five rows of the (benign) data
# Print the first five rows of the (malignant) data
Say we would like to create a histogram to see the two type of tumors. We can do this with the
# Histograms for benign and maligant tumors
sns.distplot(a=cancer_b_data['Area (mean)'], label="Benign", kde=False)
sns.distplot(a=cancer_m_data['Area (mean)'], label="Malignant", kde=False)
We customize the behavior of the command with two additional pieces of information:
a=chooses the column we’d like to plot (in this case, we chose
kde=Falseis something we’ll always provide when creating a histogram, as leaving it out will create a slightly different plot.
Malignant tumors have higher values for
'Area (mean)', on average. Malignant tumors have a larger range of potential values.
The next type of plot is a kernel density estimate (KDE) plot. In case you’re not familiar with KDE plots, you can think of it as a smoothed histogram.
To make a KDE plot, we use the
sns.kdeplot command. Setting
shade=True colors the area below the curve (and
data= has identical functionality as when we made the histogram above).In :
# KDE plots for benign and malignant tumors
sns.kdeplot(data=cancer_b_data['Radius (worst)'], shade=True, label="Benign") sns.kdeplot(data=cancer_m_data['Radius (worst)'], shade=True, label="Malignant")