I have a file, like so,
ID,CLASS_ID,COURSE_FEE,COURSE_ID
1,9XG,500,3
2,07CWG,1000,6
3,9XG,150,6
4,07CWG,1500,6
5,016IRE,80,9
6,9XG,100,10
7,07CWG,500,6
I would like to get those CLASS_IDs
that is having the greater COURSE_ID
with respect to its COURSE_FEE
. Also if a CLASS_ID
is having same COURSE_ID
, then the total sum of the corresponding COURSE_FEE
s needs to be fetched.
So the expected output would be something like so,
ID,CLASS_ID,COURSE_ID,COURSE_FEE
6,9XG,10,100
5,016IRE,9,80
2,07CWG,6,3000
This is the script I created for this:
import pandas as pandas
df = pandas.read_csv('Inpt.txt', dtype={'CLASS_ID': str})
df['COURSE_ID_AGG'] = df.groupby(['CLASS_ID', 'COURSE_ID'])['CLASS_ID'].transform(pandas.Series.value_counts)
df['SUM'] = df.groupby(['CLASS_ID', 'COURSE_ID'])['COURSE_FEE'].transform('sum')
df = df.drop_duplicates(subset=['CLASS_ID', 'COURSE_ID'])
df = df.sort_values(by=['CLASS_ID', 'COURSE_ID', 'SUM'], ascending=[0, 0, 0])
df.drop(['COURSE_FEE', 'COURSE_ID_AGG'], axis=1, inplace=True)
df = df.rename(columns={'SUM': 'COURSE_FEE'})
df = df.groupby('CLASS_ID', as_index=False).nth(0)
df.to_csv('Op.txt', index=False)
The script is giving me expected output but for small inputs.
My input file actually contains 200 Million records and because of that my script throws an OUT OF MEMORY EXCEPTION
.
Could any of you python veterans help a newbie in suggesting a way to handle such a large input on this script? Thanks in advance.