Giant Panda | Species | WWF

After you installed python, you might want to use it for a variety of tasks. In data science or finance, you often need to deal with Tabular data. The one that most of us are familiar with is the spreadsheets. These are files we use in Microsoft Excel. Finance professional spend their 90% of time working on Excel and try to have any sort of data in spreadsheets so that they can do their data analysis.

I even know some people, who would even write letters in excel. and use it as a word processor because they are so comfortable with it. But like any tool, spreadsheet have some limitation. to over come these limitations, people use VBA macros to define some custom functions. so why learn data handling with Python?

you see with Python, the possibilities are endless and python code is much faster to execute than VBA and portable among different operating systems. Having said that, let me introduce Pandas.

What is Pandas?

Pandas is the library for Python that deals with tabular data. In Pandas it is called Data Frame. A data frame is nothing more than a table with data arranged in rows and columns. Pandas was developed over a decade ago and is open sourced for use by anyone.

with Pandas, you get the capability to read various types of data formats like CSV, JSON, Excel, Pickle, etc. It allows you to represent your data in a row and column tabular fashion, which makes the data readable and presentable.

Pandas represent the data in a Data Frame form and provide you with extensive usage for data analysis and data manipulation. Once you start making sense out of the data using the various functionalities in pandas, you can then use this data for analyzing, forecasting, classifying, and much more!

pandas has an input and output API which has a set of top-level reader and writer functions. The reader function is accessed with pandas.read_json() that returns a pandas object, and the writer function is accessed with pandas.to_json() which is an object method.

It presents following functionality

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;
  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
  • Flexible reshaping and pivoting of data sets;
  • Intelligent label-based slicingfancy indexing, and subsetting of large data sets;
  • Columns can be inserted and deleted from data structures for size mutability;
  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
  • High performance merging and joining of data sets;
  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
  • Time series-functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
  • Highly optimized for performance, with critical code paths written in Cython or C.
  • Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more

Installing Pandas

In python, you can install the pandas as follows

pip install pandas

Loading Pandas

In Python, or jupyter notebooks, pandas can be loaded as follows

import pandas as pd

Importing data

Importing data is one of the most essential and very first steps in any data related problem. The ability to import the data correctly is a must-have skill for every aspiring data scientist.

Data exists in many different forms, and not only should you know how to import various data formats but also how to analyze and manipulate the data to gain useful insights.

Pandas has a Reader and a Writer function. The Reader function allows you to read the different data formats, while the Writer function enables you to save the data in a particular format.

Below are data formats that DataFrame supports, which means if your data is in any of the below forms, you can use pandas to load that data format and even write into a particular format.
(Source)

so in order to load data to analyse further, you can use the following command from the above table, e.g. for JSON data.

Loading JSON Data

JSON, also known as JavaScript Object Notation, is a data-interchange text-serialization format. JSON is easy to read and write. It is based on a subset of the JavaScript Programming Language but uses conventions from Python, and many other languages outside of Python.

JSON is mostly used to store unstructured data, and SQL databases have tough time saving it.

This is mainly built on two structures:

  • A collection of key/value pairs. In Python, a key/value pair is referred to as a Dictionary, and a key is a unique attribute, whereas values are not.
  • An ordered list of values. The ordered list can also sometimes be a list of lists. Lists in Python are a set of values which can be a string, integer, etc.

So let’s now see how you can load the JSON data in multiple ways.

The first JSON dataset is from this link. The data is in a key-value dictionary format. There are a total of three keys: namely integer, datetime, and category.

  • First, you will import the pandas library and then pass the URL to the pd.read_json() which will return a dataframe. The columns of the dataframes represent the keys, and the rows are the values of the JSON.
import pandas as pd

json = pd.read_json('https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json')

Let’s quickly print the last few rows of the JSON that you read using the .tail() function.

json.tail(6)
integerdatetimecategory
9452015-01-01 00:01:340
9592015-01-01 00:01:350
9682015-01-01 00:01:360
9762015-01-01 00:01:370
9882015-01-01 00:01:380
9912015-01-01 00:01:390
json.shape
(100, 3)

From the above output, you can see that there are three total columns: integerdatetime, and category. Also, there are 100 samples in the dataset as verified from the .shape method which returned a 100 x 3 output.

  • Writing a JSON

Writing the JSON data is as simple as reading and is one line of code. Instead of read_json() you will use to_json() with a filename and that’s all!

json.to_json('dataframe.json')
  • Parsing Nested JSON as a String

Next, you will use another type of JSON dataset, which is not as simple. It is a nested JSON structure. Nested JSON structure means that each key can have more keys associated with it.

Loading HTML Data

HTML is a Hypertext Markup Language that is mainly used for created web applications and pages. It tries to describe the structure of the web page semantically. The web browser receives an HTML document from a web server and renders it to a multimedia web page.

For web applications, HTML is used with cascading style sheets (CSS) while at the server end it collaborates with various web server frameworks like Flask, Django, etc.

HTML uses tags to define each block of code like a <p></p> tag for the start and end of a paragraph, <image></image> tag for adding content to the web page as an image and similarly there are many tags that together collate to form an HTML web page.

To read an HTML file, pandas dataframe looks for a tag. That tag is called a <td></td> tag. This tag is used for defining a table in HTML.

pandas uses read_html() to read the HTML document.

So, whenever you pass an HTML to pandas and expect it to output a nice looking dataframe, make sure the HTML page has a table in it!

  • About the Data: You will be using a Cryptocurrency website as an HTML dataset that has various crypto coins on it and has various details about each coin like:
    • Last price of the coin (Last price)
    • Whether the coin price has increased or decreased (in percentage %)
    • The volume of 24 hours
    • Total number of coins (# Coins)

So, let’s get started!

You will first import requests library which will help you in sending a request to the URL from where you want to fetch the HTML content.

import requests
url = 'https://www.worldcoinindex.com/'
crypto_url = requests.get(url)
crypto_url
<Response [200]>

So, until now, you defined the URL and then using requests.get() you sent a request to that URL and received a response as an acknowledgement [200] OK which means that you were able to connect with that web server.

Now, to read the content of the HTML web page, all you need to do is call crypto_url.text which will give you the HTML code of that cryptocurrency web page.

Feel free to run the command crypto_url.text to see the output.

Finally, you will pass crypto_url.text to the pd.read_html() function which will return you a list of dataframes where each element in that list is a table (dataframe) the cryptocurrency webpage has in it.

crypto_data = pd.read_html(crypto_url.text)

Let’s print the length and the type of the dataframe. The type should be a list.

len(crypto_data), type(crypto_data)
(1, list)

From the above output, it is clear that there is only 1 table with a type list.

crypto_data = crypto_data[0]

Let’s remove the first and second column since they do not have any useful information in them and keep all the rows.

crypto_final = crypto_data.iloc[:,2:]

Finally, it’s time to print the cryptocurrency dataframe!

crypto_final.head()
NameTickerLast price%24 high24 lowPrice Charts 7d24 volume# CoinsMarket cap
0bitcoinBTC$ 8,008.027+1.83%$ 8,056.630$ 7,812.784NaN$ 12.04B17.71M$ 141.89B
1ethereumETH$ 251.72335+2.68%$ 253.84721$ 242.95687NaN$ 6.93B106.20M$ 26.73B
2litecoinLTC$ 98.633851+11.08%$ 99.946324$ 88.618815NaN$ 3.46B61.91M$ 6.10B
3bitcoincashBCH$ 411.94075+2.31%$ 418.60850$ 394.18927NaN$ 2.10B17.79M$ 7.33B
4eosEOS$ 6.4230717+6.05%$ 6.4765695$ 6.0264079NaN$ 2.06B1.01B$ 6.50B

From the above table, you can observe that Bitcoin has the most Market Cap.

Dropping NaN (Not a Number)

There are some values in the dataframe that are not real values, so let’s quickly remove them from the table.

However, first, let’s completely drop (delete) the Price Charts 7d column since it is entirely NaN and has zero information in it.

del crypto_final['Price Charts 7d']
crypto_final.head()
NameTickerLast price%24 high24 low24 volume# CoinsMarket cap
0bitcoinBTC$ 8,008.027+1.83%$ 8,056.630$ 7,812.784$ 12.04B17.71M$ 141.89B
1ethereumETH$ 251.72335+2.68%$ 253.84721$ 242.95687$ 6.93B106.20M$ 26.73B
2litecoinLTC$ 98.633851+11.08%$ 99.946324$ 88.618815$ 3.46B61.91M$ 6.10B
3bitcoincashBCH$ 411.94075+2.31%$ 418.60850$ 394.18927$ 2.10B17.79M$ 7.33B
4eosEOS$ 6.4230717+6.05%$ 6.4765695$ 6.0264079$ 2.06B1.01B$ 6.50B

Now let’s remove the NaN's.

crypto_final = crypto_final.dropna()

Visualizing the Crypto Coins (Ticker) Vs. Percentage Increase/Decrease in Price (%)

Let’s first import matplotlib for plotting the graph.

import matplotlib.pyplot as plt
%matplotlib inline

The % column has a type string, and you need to convert the type string to float. First, you will remove the % sign from the column and then convert the type of the % column to float.

crypto_final['%'] = crypto_final['%'].apply(lambda x: x.strip('%'))
crypto_final['%'] = crypto_final['%'].astype('float')

Let’s now plot the data.

plt.figure()
plt.figure(figsize=(16,10))
x = crypto_final.iloc[:20]['Ticker']
y = crypto_final.iloc[:20]['%']
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Percentage Increase/Decrease in Price',fontsize=20)
plt.ylabel('Ticker',fontsize=20)

plt.plot(x,y,label='% Increase/Decrease in Price')
plt.legend(loc='lower left',prop={'size': 15})
<matplotlib.legend.Legend at 0x1298a9630>

<Figure size 432x288 with 0 Axes>

From the above graph, you can observe that the Litecoin (LTC) and Huobitoken (HT) have the maximum increase in the price and on the other hand, Maticnetwork (MATIC) and Waves have the maximum decrease in the price of the crypto coins.

Loading Pickle Data

Pickle is a Python-specific binary serialization format which is not human-readable, unlike JSON. Pickle is used for serializing and deserializing an object structure of Python. It serializes the object and Pickles it to save it on a disk. It converts the object like DataFrame, list, dictionary, etc. into a character stream.

The best part about Pickle is that it can store various kinds of Python data types.

Pickle is widely used for storing trained machine learning algorithm instances. Like JSON, Pickle also has handy functions like pickle.load() for loading a Pickle format file, and pickle.dump() for saving a Pickle or any other format in Pickle format.

Another import advantage of using Pickle is that Saving the dataframe as a Pickle file required less space on the disk and keeps the type of the data intact when reloaded.

So, let’s quickly pickle the cryptocurrency dataframe you constructed earlier, and then you will read that pickled object using pandas.

import pickle

You will use pickle.dump to save the dataframe as a Pickle object and use the protocol as HIGHEST_PROTOCOL, which is backward compatible with Python 2 as well.

with open('crypto_final.pickle', 'wb') as sub_data:
    pickle.dump(crypto_final, sub_data, protocol=pickle.HIGHEST_PROTOCOL)

Finally, you will use the pandas read_pickle function to convert the pickle object into a dataframe.

crypto_final = pd.read_pickle('crypto_final.pickle')
crypto_final.head()
NameTickerLast price%24 high24 low24 volume# CoinsMarket cap
0bitcoinBTC$ 7,776.567+1.81%$ 7,870.205$ 7,506.518$ 12.70B17.71M$ 137.78B
1ethereumETH$ 241.81372-0.97%$ 245.57293$ 232.62523$ 7.47B106.18M$ 25.67B
2litecoinLTC$ 88.062811+0.33%$ 88.946501$ 85.248641$ 2.59B61.90M$ 5.45B
3bitcoincashBCH$ 388.76089+0.06%$ 398.27697$ 370.29831$ 2.33B17.79M$ 6.91B
4eosEOS$ 5.9327948-0.41%$ 6.0163442$ 5.7789154$ 1.99B1.01B$ 6.00B

Wow! Isn’t that amazing? How straightforward that was. Moreover, with Pickle, you do not have to worry about the datatype conversion errors as the pickle serialization can handle that for you!

Please feel free to ask any questions related to this tutorial in the comments section below.