I've got access to some data that's hundreds of millions of rows for any given month. 3 features : a string representing a date, a string representing a type and a value representing an amount.
Having access to python and impala(SQL), what's the best approach to calculate the median on these millions of rows for each type in each month?
If I use a simple group by : the type and substring of the date part to get the month eg substring(date,1,4) and use the APPX_MEDIAN function for the median I just end up running out of memory for the Impala query.
If I try to take the raw data down as a CSV (say with DBeaver), it's huge - GBs in size, too big to fit into the memory of the VM I have access to that would hold the CSV if I try to push it into a python pandas dataframe.
I'm not familiar with patterns that deal with big data, so any tips at all would be very much appreciated. I'm struggling to perform a simple calculation due to the sheer size of the data.