If you are beginning to learn Data science, you will encounter Python on your way. In this tutorial, I will teach you how to setup the python environment for data science.
As a best practice to start, you should have proper tools. When I first began my journey, I had to face many issues in installing and setting up environment for my daily use. So, I vowed to not let that happen to anyone else by making this handy guide.
Python is available for a nearly all the operating systems. In this tutorial, the code is being run on Ubuntu, and can be replicated on a wide variety of Unix/Linux distributions and Mac . For Windows systems, there is a slight difference which I have tried to highlight wherever necessary
Python the Language
Thinking of any programming language, it comes in mind that an interpreter is required to translate the code to machine readable format. In case of python, there are many options available and two versions of the official language to choose from.
It is absolutely vital to choose correct interpreter and version to begin with.
Python, Anaconda, Miniconda, PyCharm, IronPython… and many more
due to multitude of possibilities with Python, there are several popular interpreters for using the language. Each one has their own strengths and are tailored towards their specific industry. The official version can be obtained from Python.org. The official page of Python Software Foundation and responsible for maintaining the official versioning of the language.
Choosing the right version
Just like any language, the libraries play and important part in their usability. Since there are major differences in syntax in Python 3 vs Python 2 and a lot of libraries available for Python 2. these are not available for Python 3 yet. Python 2 is still around and very much in development even after 2 decades. A lot of work is underway to migrate the libraries to Python 3 and to bridge the gap.
January 1, 2020 was the end of support for Python 2 officially and now Python.org do not maintain the language anymore. There are many developers, data scientists, enthusiasts still using it and continue to do so until the libraries of their choice are ported to Python 3
Installing Python on your Computer
As explained earlier, Python interpreter is widely available for all operating systems, and comes preinstalled on most Linux distributions. For Ubuntu, Python2 and Python3 are installed by default on desktop environments simultaneously.
As mentioned above, after the end of life for Python 2, only one version will be available. my recommendation here is to begin with, install the latest version of the language. To check the current version of the language try the following command in the terminal (Use windows PowerShell for windows)
$ python --version Python 3.6.2
In order to install for Windows system, installing Python interpreter is pretty much straightforward. The official downloads page provides an installer that helps you in setting up the interpreter in no time.
Installing the libraries with pip
Having the official version of language installed is usually enough to begin with. as you move along you will be needing functionality not offered by standard interpreter and will need to expand the language. To do this, you will be using the libraries as discussed above.
so how do you get these libraries? the default installation have a handy piece of application called pip. it is your go to tool for installing libraries for PyPi (the Python package index) the official python repository for hosting libraries or access local repositories.
Since our main focus is to work on data science. We will first install Jupyter Notebook to create reproducible code that can be shared online and is very familiar to the one you use on Kaggle.
$ pip install jupyter
In additional to installing new libraries, pip offers a host of commands to deal with maintenance of the libraries. They are given below for your reference.
|Install a package|
|Update a package|
|Uninstall a package|
|Download a package|
|List installed packages|
|Install packages listed in the given requirements.txt file|
|Shows information about the package|
It is noteworthy that pip will also take care of all the dependencies of the libraries being installed. so you don’t need to worry about that part.
So after you have installed jupyter. the notebook can be opened using the following command in terminal or Windows PowerShell
$ jupyter notebook
Virtual environments are a handy way to isolate the installation of your libraries into a container so that you may experiment with experimental feature and not get your system in jeopardy. The virtual environments comes in handy in a number of scenarios
- Install the new package without its dependencies. The library will probably malfunction unless you make some changes to its code.
- Install the new package with dependencies. Your current project will be affected, forcing you to upgrade the project configuration and making some changes to make it work as expected.
The described scenarios may happen when working on two projects simultaneously. Try thinking on more complex situations where you can work on more than two projects, or having a set of Python-based applications that depend on the library you need to have installed without modifications. That may be a real headache. Fortunately, Python provides you with a mechanism for isolating the project configuration from other projects or applications. Python interpreters older than 3.6 can install the library using pip as follows:
$ pip install -U venv
Those developers using Ubuntu may find virtualenv as a system package that can be installed using apt-get as follows:
$ sudo apt-get install python3-venv
Creating a new virtual environment is easy. From the command line, type the following command to create a new virtual environment inside directory new_venv.
$ python -m venv new_venv
Windows users shall execute the activate script installed inside Scripts directory.
new_venv/Scripts/activate (new_venv) $ ...
Notice that once the virtual environment has been loaded, the prompt shell will change, writing the virtual environment name between parentheses. At a glance, the virtual environment contains the following elements:
- A python interpreter: The same that the one used for creating the virtual environment. Notice that unless you do the change of python alias, you will have to explicitly use python3 or even python3.6 in case you have more than one Python3 installed.
- Pip package manager. The package manager will install all the libraries inside the lib/pythonX.Y/site/packages directory. The python interpreter will use this directory for resolving dependencies.
- Activation/Deactivation script. It helps loading the project environment, and returning back to the default configuration once it is not needed. With the virtual environment loaded successfully, you can install any library as follows:
$ pip install requests
This command will install requests library inside the virtual environment, and it can be loaded by the python interpreter stored inside the bin folder of virtual environment directory. You can check the installation by using the pip show command (pip show requests) that will provide details about the installed library. Finally, once you stop working on a project, you can deactivate the virtual environment by executing the deactivate command.
(new_venv) $ deactivate $
So simple, so powerful, so helpful. Here are the virtual environments to help! 😀
TIP: The location where packages are installed can be checked through the sys.prefix variable.
Try executing it on a loaded environment, and also in a default configuration.
>>> import sys >>> print(sys.prefix)