How to profile large datasets with Pandas profiling?

Question

Data is not perfectly clean, but is used without issue with pandas. The pandas library provides many extremely useful functions for EDA.

But when I use profiling for large data i.e 100 million records with 10 columns, reading it from a database table, it does not complete and my laptop runs out of memory, the size of data in csv is around 6 gb and my RAM is 14 GB my idle usage is around 3 - 4 GB approximately.

df = pd.read_sql_query("select * from table", conn_params)
profile = pandas.profiling.ProfileReport(df)
profile.to_file(outputfile="myoutput.html")

I have also tried with check_recoded = False option as well. But it does not help in profiling entirely. Is there any way to chunk and read the data and finally generate the summary report as a whole? OR any other method to use this function with large dataset.

You can read the data in chunks and use the server-side cursors technique to read data in streams while maintaining chunk size. For more details read here https://pythonspeed.com/articles/pandas-sql-chunking/ — Himanshu Singhal, Nov 03 '21 at 08:41

Giorgos Myrianthous · Answer 1 · 2020-09-11T23:32:16.040

11

v2.4 introduced the minimal mode that disables expensive computations (such as correlations and dynamic binning):

from pandas_profiling import ProfileReport


profile = ProfileReport(df, minimal=True)
profile.to_file(output_file="output.html")

edited Sep 11 '20 at 23:32

answered Mar 18 '20 at 17:11

Giorgos Myrianthous

36,235
20
134
156

1

Although `minimal=True` is helpful to avoid expensive computations but I guess the author of the question is trying to ask how to chunk and read the data and finally generate the summary report as a whole. I am also kinda stuck at this issue where I need to profile multiple chunked pieces of the data frame and generate the final report. Any help would be appreciated. – Himanshu Singhal Nov 03 '21 at 08:38

score 5 · Answer 2 · answered Dec 09 '19 at 20:43

The syntax to disable the calculation of correlations (thereby heavily reducing calculations) has changed a lot between pandas-profiling=1.4 and the current (beta-)version pandas-profiling=2.0 to the following:

profile = df.profile_report(correlations={
    "pearson": False,
    "spearman": False,
    "kendall": False,
    "phi_k": False,
    "cramers": False,
    "recoded":False,}
)

Additionally, you can reduce performed calculations by disabling the calculations of bins for the plotting of histograms.

profile = df.profile_report(plot={'histogram': {'bins': None}}

I get `ValueError: Config parameter "recoded" does not exist.` — Monica Heddneck, May 19 '21 at 00:58

score 1 · Answer 3 · answered Jun 10 '19 at 04:48

1

Did you try with the below option as when doing correlation analysis on large free text fields using pandas profiling might cause this issue?

df = pd.read_sql_query("select * from table", conn_params)
profile = pandas.profiling.ProfileReport(df, , check_correlation = False)

Please refer the below github link for more details: https://github.com/pandas-profiling/pandas-profiling/issues/84

answered Jun 10 '19 at 04:48

Ashutosh Kumar

301
3
10

correlation is one of the features i want to have. chunk + read etc option would ve been great, but i see thats not available yet – Viv Jun 11 '19 at 09:34
1

Yes this option is currently not available, hope to see the same in the next version of the pandas.profiling. Its more like "report" now for a given dataset/dataframe. – Ashutosh Kumar Jun 11 '19 at 10:03

score 0 · Answer 4 · answered Jan 14 '21 at 02:06

0

Another option is to reduce the data.

One option may be achieved with sample:

df.sample(number)

!/bin/sh

PROF_DIR="$HOME/Git/pandas-profiling/"

export PYTHONPATH="$PYTHONPATH:$PROF_DIR"

jupyter notebook

How to profile large datasets with Pandas profiling?

5 Answers5

!/bin/sh