4

I am running into memory and performance issues when trying to implement a density plot using python's plotnine.

Consider the below dataset with 3 variables and 50,000 observations. This is not a large dataset. The below code took 15 minutes to run. In contrast, it ran in R in 0.22 seconds.

With n = 100000, I get the following error in plotnine:

MemoryError: Unable to allocate 74.5 GiB for an array with shape (100000, 100000) and data type float64

Again, R was able to execute this in circa 0.2 seconds.

Am I mis-specifying the plotnine code, or is this a known problem that will be fixed?

plotnine code:

import numpy as np
import pandas as pd
from plotnine import *

n = 100000

df = pd.DataFrame({
    'age': np.random.choice(range(20,66),n),
    'gender': np.random.choice(range(1,3),n),
    'variable': np.random.lognormal(0,0.5,n),
})

p = (ggplot(df, aes('variable'))
  + theme_light(7)
  + geom_density(alpha=0.5, size=0.35)
)
p

R code:

library(ggplot2)

n = 100000

df = data.frame(
        age = sample(seq(20:66), n, replace=TRUE),
        gender = sample(1:2, n, replace=TRUE),
        variable = rlnorm(n, meanlog=0, sdlog=0.5)
)

p = ggplot(df, aes(variable)) + 
      theme_light(7) +
      geom_density(alpha=0.5, size=0.35)
p
brb
  • 1,123
  • 17
  • 40
  • 1
    This is bug and will be fixed. For the type of density (Gaussian) being computed, it should be a able to used a fast and space efficient computation route (using FFT). – has2k1 Apr 23 '21 at 15:01
  • 1
    Thank you has2k1, appreciate you confirming. Can I also just say how much I appreciate all your hard work and effort - I love the package, it is tremendous!! – brb Apr 24 '21 at 12:07

0 Answers0