0

I have a file, like so,

ID,CLASS_ID,COURSE_FEE,COURSE_ID
1,9XG,500,3
2,07CWG,1000,6
3,9XG,150,6
4,07CWG,1500,6
5,016IRE,80,9
6,9XG,100,10
7,07CWG,500,6

I would like to get those CLASS_IDs that is having the greater COURSE_ID with respect to its COURSE_FEE. Also if a CLASS_ID is having same COURSE_ID, then the total sum of the corresponding COURSE_FEEs needs to be fetched.

So the expected output would be something like so,

ID,CLASS_ID,COURSE_ID,COURSE_FEE
6,9XG,10,100
5,016IRE,9,80
2,07CWG,6,3000

This is the script I created for this:

import pandas as pandas

df = pandas.read_csv('Inpt.txt', dtype={'CLASS_ID': str})
df['COURSE_ID_AGG'] = df.groupby(['CLASS_ID', 'COURSE_ID'])['CLASS_ID'].transform(pandas.Series.value_counts)
df['SUM'] = df.groupby(['CLASS_ID', 'COURSE_ID'])['COURSE_FEE'].transform('sum')
df = df.drop_duplicates(subset=['CLASS_ID', 'COURSE_ID'])
df = df.sort_values(by=['CLASS_ID', 'COURSE_ID', 'SUM'], ascending=[0, 0, 0])
df.drop(['COURSE_FEE', 'COURSE_ID_AGG'], axis=1, inplace=True)
df = df.rename(columns={'SUM': 'COURSE_FEE'})
df = df.groupby('CLASS_ID', as_index=False).nth(0)
df.to_csv('Op.txt', index=False)

The script is giving me expected output but for small inputs.

My input file actually contains 200 Million records and because of that my script throws an OUT OF MEMORY EXCEPTION.

Could any of you python veterans help a newbie in suggesting a way to handle such a large input on this script? Thanks in advance.

Marek
  • 245
  • 1
  • 4
  • 15
  • pandas cannot handle 200 million records. Not unless you do one chunk at a time. – cs95 May 29 '18 at 05:52
  • pd.read_csv reads the entire dataset into memory. Since this dataset does not fit into memory, you could try reading the data in chunks as suggested in [this post](https://stackoverflow.com/questions/46247560/how-to-read-and-write-files-from-disk-using-the-pandas-dataframe?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa) – PyRsquared May 29 '18 at 05:56
  • @coldspeed: Could you explain me how I can do that and how to handle the groupbys on these chunks?. I'm very new to programming so please bear with me. – Marek May 29 '18 at 06:03
  • See here: https://stackoverflow.com/a/14268804/4909087 – cs95 May 29 '18 at 06:03
  • https://stackoverflow.com/questions/50051210/avoiding-memory-issues-for-groupby-on-large-pandas-dataframe – illusionx Jul 21 '20 at 15:03

0 Answers0