I can think of a few approaches to do this.
The first one is the most applicable to this use case, but I'm adding three more for related situations.
1. Python (matplotlib + numpy numeric arrays)
If you want to stick with Python, you can follow hitzg's answer. But there are a few critical details to take into consideration. Once you generate the first boxplot, you don't need that data anymore, so ensure you free up that memory. Adapting the other answer, the code looks like this:
import matplotlib.cbook as cbook
import matplotlib.pyplot as plt
import numpy as np
x = np.random.rand(10,10)
y = np.random.rand(10,10)
z = np.random.rand(10,10)
fig, ax = plt.subplots(1,1)
bxpstats = list()
for dataset, label in zip([x, y, z], ['X', 'Y', 'Z']):
bxpstats.extend(cbook.boxplot_stats(np.ravel(dataset), labels=[label]))
# free up the memory
del dataset
ax.bxp(bxpstats)
plt.show()
If you are using numeric numpy arrays, using del
will release the memory. However, this won't work if using numpy objects or pandas data frames (explanation), see next options for alternatives.
2. Python (matplotlib + pandas data frames)
If you're using pandas data frames. Then using del data_frame
, won't release the memory. However, you can compute the boxplot statistics and store them (e.g. in a JSON file) and then kill the process, to ensure the memory is released. You can compute the statistics with matplotlib.cbook.boxplot_stats
, store in JSON, load the JSON files in a new process, and use bxp
to plot. Something like this:
python boxplot-stats.py --path some_data.csv
python boxplot-stats.py --path more_data.csv
python plot.py --path some_data.csv --path more_data.csv
(of course, you'd need to write the command-line interface to make it work)
3. Python (JupySQL) - easiest option if data is in CSV or parquet format
If your data is in .csv
, or .parquet
format (or you can convert it), you can use JupySQL; which has a plotting module that leverages SQL engines to efficiently compute statistics for plotting boxplots and histograms (example here, and here). Under the hood, it can use DuckDB to compute the statistics, and then passes them to matplotlib for plotting (without having to load all your data into memory!).
Code looks like this:
from sqlalchemy import create_engine
from sql import plot
conn = create_engine('duckdb:///')
plot.boxplot('path/to/data.parquet', 'column_to_plot', conn)
Note that you need these packages:
pip install jupysql duckdb duckdb-engine pyarrow
4. DuckDB + Python
Finally, you can use DuckDB directly, this will give you more flexibility, but you'll have to implement quite a few things. For a basic boxplot, all you need are quantiles, which you can quickly compute from DuckDB; here's a template you can use (just substitute the {{placeholders}}
):
SELECT
percentile_disc(0.25) WITHIN GROUP (ORDER BY "{{column}}") AS q1,
percentile_disc(0.50) WITHIN GROUP (ORDER BY "{{column}}") AS med,
percentile_disc(0.75) WITHIN GROUP (ORDER BY "{{column}}") AS q3,
AVG("{{column}}") AS mean,
COUNT(*) AS N
FROM "{{path/to/data.parquet}}"
To create a complete boxplot, you need a few more statistics. To know exactly which ones and how to compute them, you can use matplotlib's boxplot_stats
as reference, then compute the aggregations with DuckDB and the rest in Python, then pass that to matplotlib's bxp
function. This is actually how JupySQL works, you can use the implementation as reference.