2

I've got access to some data that's hundreds of millions of rows for any given month. 3 features : a string representing a date, a string representing a type and a value representing an amount.

Having access to python and impala(SQL), what's the best approach to calculate the median on these millions of rows for each type in each month?

If I use a simple group by : the type and substring of the date part to get the month eg substring(date,1,4) and use the APPX_MEDIAN function for the median I just end up running out of memory for the Impala query.

If I try to take the raw data down as a CSV (say with DBeaver), it's huge - GBs in size, too big to fit into the memory of the VM I have access to that would hold the CSV if I try to push it into a python pandas dataframe.

I'm not familiar with patterns that deal with big data, so any tips at all would be very much appreciated. I'm struggling to perform a simple calculation due to the sheer size of the data.

Rob
  • 153
  • 12

1 Answers1

1

You can try to increase the amount of memory Impala is using to execute a query by specifying SET MEM_LIMIT=Xg, where X will be memory in GB per Impala daemon. See https://impala.apache.org/docs/build/html/topics/impala_mem_limit.html for additional details.

mazaneicha
  • 8,794
  • 4
  • 33
  • 52