I have a .txt file containing a large dataset (more than 90 million entries) in the following format:
Score | Student Name |
---|---|
35 | Lily |
45 | Rex |
20 | Cameron |
45 | Max |
20 | Jasmin |
In the text file, the score and the name are separated by 2 spaces and has one score-name entry per line
This .txt file cannot be loaded into the memory at a time.
How to obtain the first N highest scorers in python ?
Note: the value of N can be extremely large
Example:
So when N=2,
the output should be :
Rex
Max
Is there a way in python to directly obtain the first N scorers without saving the whole data again in another file format ?
Which way is more efficient ?
1.) read score entries one by one and save/update the largest N score entries?
2.) move all data to pandas dataframes and use nlargest ?