Happy New Year 2022!
This is the first edition of EverythingPython of the year and I hope to continue publishing more editions of the newsletter with some gusto this year.
This time, I’m going to be talking about an amazing package called pandas-profiling. I use it a lot, I’ve contributed atleast one PR to it, and I really like what they’ve done with it so far. So when I saw that there were a lot of Github issues on the official repository of-late popping up about errors in installation and getting started, I thought it would be useful for people to understand how to go about it in a methodical manner with almost no pitfalls (hopefully).
But before I go into its installation, for those of you new to the package -
What is pandas-profiling?
At a high level, it’s a very useful package to summarize pandas dataframes and describe it a little better than the describe() function. Going a little bit deeper, pandas-profiling can provide a lot of insights about the dataframe -
Missing values
Duplicate Rows
Unique values
Interactions of columns with each other
Correlations - Spearman's ρ, Pearson's r, Kendall's τ and Phik (φk) in particular
And a sample of what the dataframe contains.
When this much information is available to us, it seems a shame to not use it in projects that involve examining datasets and subsequent analysis of them.
So, let’s see how to set it up, which is the scope of this blogpost. Later blogposts will dive deeper into the usage of pandas-profiling
The Setup
The overall setup and check basically involves 4 steps -
Install Anaconda
Create a virtual environment
Install pandas-profiling and jupyter-lab in the new virtual environment.
Open up a jupyterlab notebook and try to generate a Profile Report
While this seems simple on the face of it, the number of people who attempt to set it up directly in their base Python environments seems painfully high. And that’s a recipe for disaster. This setup assumes a Windows installation. If you want to set it up on WSL or Ubuntu, I’ll add links to that below.
So
Step 1 - Download and install Anaconda based on your Operating System from the official website - https://www.anaconda.com/products/individual . There will be an option to add conda to your PATH. Select it even if it’s not recommended.
Step 2 - From the Anaconda prompt application, create a new virtual environment like so -
conda create --name <nameOfEnvironment>
Activate the created environment -
conda activate <nameOfEnvironment>
Step 3 - Install pandas-profiling and jupyter-lab
conda install -c conda-forge pandas-profiling
pip install jupyterlab
Now Jupyterlab might later complain that a Profile Report cannot be generated, so you want to install `ipywidgets` as well
pip install ipywidgets
Step 4 - Spawn a new Jupyterlab notebook using `jupyterlab` and try to verify if your installation was successful :
In the Jupyterlab notebook, try the example as in the official documentation -
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])
profile = ProfileReport(df, title="Pandas Profiling Report")
profile
If everything was fine with the installation, you should see a reported being generated like so -
culminating in a full report with details about your dataframe -
And voila! You’re now ready to begin using pandas-profiling.
If learning from a Youtube video is more your thing, I’ve decided to start publishing a Youtube version of these posts as well (in addition to other Python-ic information) . Please say hello to the first video of EverythingPython and consider Liking and Sharing the video and Subscribing to the channel ;)-
Resources -
Installing Anaconda on Windows - https://www.anaconda.com/products/individual
Handy Conda command cheatlist - https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html
Installing jupyter-lab - https://jupyter.org/install
Installing Anaconda on WSL - https://www.emilykauffman.com/teaching/install-anaconda-on-wsl
https://www.digitalocean.com/community/tutorials/how-to-install-the-anaconda-python-distribution-on-ubuntu-20-04