12

Data is not perfectly clean, but is used without issue with pandas. The pandas library provides many extremely useful functions for EDA.

But when I use profiling for large data i.e 100 million records with 10 columns, reading it from a database table, it does not complete and my laptop runs out of memory, the size of data in csv is around 6 gb and my RAM is 14 GB my idle usage is around 3 - 4 GB approximately.

df = pd.read_sql_query("select * from table", conn_params)
profile = pandas.profiling.ProfileReport(df)
profile.to_file(outputfile="myoutput.html")

I have also tried with check_recoded = False option as well. But it does not help in profiling entirely. Is there any way to chunk and read the data and finally generate the summary report as a whole? OR any other method to use this function with large dataset.

Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156
Viv
  • 1,474
  • 5
  • 28
  • 47
  • You can read the data in chunks and use the server-side cursors technique to read data in streams while maintaining chunk size. For more details read here https://pythonspeed.com/articles/pandas-sql-chunking/ – Himanshu Singhal Nov 03 '21 at 08:41

5 Answers5

11

v2.4 introduced the minimal mode that disables expensive computations (such as correlations and dynamic binning):

from pandas_profiling import ProfileReport


profile = ProfileReport(df, minimal=True)
profile.to_file(output_file="output.html")
Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156
  • 1
    Although `minimal=True` is helpful to avoid expensive computations but I guess the author of the question is trying to ask how to chunk and read the data and finally generate the summary report as a whole. I am also kinda stuck at this issue where I need to profile multiple chunked pieces of the data frame and generate the final report. Any help would be appreciated. – Himanshu Singhal Nov 03 '21 at 08:38
5

The syntax to disable the calculation of correlations (thereby heavily reducing calculations) has changed a lot between pandas-profiling=1.4 and the current (beta-)version pandas-profiling=2.0 to the following:

profile = df.profile_report(correlations={
    "pearson": False,
    "spearman": False,
    "kendall": False,
    "phi_k": False,
    "cramers": False,
    "recoded":False,}
)

Additionally, you can reduce performed calculations by disabling the calculations of bins for the plotting of histograms.

profile = df.profile_report(plot={'histogram': {'bins': None}}
cptnJ
  • 230
  • 2
  • 8
1

Did you try with the below option as when doing correlation analysis on large free text fields using pandas profiling might cause this issue?

df = pd.read_sql_query("select * from table", conn_params)
profile = pandas.profiling.ProfileReport(df, , check_correlation = False)

Please refer the below github link for more details: https://github.com/pandas-profiling/pandas-profiling/issues/84

Ashutosh Kumar
  • 301
  • 3
  • 10
  • correlation is one of the features i want to have. chunk + read etc option would ve been great, but i see thats not available yet – Viv Jun 11 '19 at 09:34
  • 1
    Yes this option is currently not available, hope to see the same in the next version of the pandas.profiling. Its more like "report" now for a given dataset/dataframe. – Ashutosh Kumar Jun 11 '19 at 10:03
0

Another option is to reduce the data.

One option may be achieved with sample:

df.sample(number)

More details on pandas documentation.

-2

The ability to disable the check correlation has been added with the implementation of the issue #43 which is not part of the latest version of pandas-profiling (1.4) available in PyPI. It has been implemented after and will be available, I guess, in the next version. In the meantime, if you really need it, you can download the current version from github and use it for example by adding it to your PYTHONPATH.

!/bin/sh

PROF_DIR="$HOME/Git/pandas-profiling/"

export PYTHONPATH="$PYTHONPATH:$PROF_DIR"

jupyter notebook

Community
  • 1
  • 1