How to obtain the names of N highest scorers given a large dataset of student scores?

Question

I have a .txt file containing a large dataset (more than 90 million entries) in the following format:

Score	Student Name
35	Lily
45	Rex
20	Cameron
45	Max
20	Jasmin

In the text file, the score and the name are separated by 2 spaces and has one score-name entry per line

This .txt file cannot be loaded into the memory at a time.

How to obtain the first N highest scorers in python ?
Note: the value of N can be extremely large

Example:
So when N=2,

the output should be :
Rex
Max

Is there a way in python to directly obtain the first N scorers without saving the whole data again in another file format ?

Which way is more efficient ?
1.) read score entries one by one and save/update the largest N score entries?
2.) move all data to pandas dataframes and use nlargest ?

Ofek Glick · Accepted Answer · 2021-09-08T08:08:03.540

1

To read the text file into a pandas DataFrame the answer to that is here.
Then you can try the following: you can try using pandas nlargest. For example you can do:

largest = df.nlargest(n,'score')['Student Name']

You can also convert the score column to a Numpy array and use argsort

import numpy as np
largest = df.iloc[np.argsort(-df['score'])[:n]]['Student Name']

Additionally you can try sorting the DataFrame and take the top n rows like so:

largest = df.sort_values('score', ascending=False).iloc[:n]['Student Name']

Here is the comparison of runtime for a DataFrame with 100 million records and n=1000000

 df = pd.DataFrame(np.random.randint(0, 100, size=(100000000, 2)),  columns=['score', 'Student Name'])
 n = 1000000
 start = time()
 temp = df.nlargest(n, 'score')['Student Name']
 print(time() - start)

 start = time()
 temp2 = df.iloc[np.argsort(-df['score'])[:n]]['Student Name']
 print(time() - start)

 start = time()
 temp3 = df.sort_values('score', ascending=False).iloc[:n]['Student Name']
 print(time() - start)

Resluts:

3.5889642238616943
13.237002849578857
19.69099760055542

So the most efficient way would be to use nlargest

edited Sep 08 '21 at 08:08

answered Sep 08 '21 at 07:27

Ofek Glick

999
2
9
20

Since, the whole datatset cannot be fit into the memory, so is it better to first convert the text file data into pandas dataframes and use nlargest, or is there a way to do it directly ? – Drago Ram Sep 08 '21 at 07:38
`pandas` is not a file format but a python Object, please provide more information about the problem you are having, what you have tried and where did you encounter trouble. – Ofek Glick Sep 08 '21 at 07:41
1

Also, since there are only 1 million entries in your text file they should easily be uploaded to a pandas DataFrame – Ofek Glick Sep 08 '21 at 07:52
1

One last thing, if want an explanation on how to parse the .txt file into a pandas DataFrame please provide more information as to how it is saved – Ofek Glick Sep 08 '21 at 07:53
I updated the question, I actually have more than 90 million entries and they dont fit in my memory – Drago Ram Sep 08 '21 at 07:59
In the text file there is one entry per line, in each line the score comes first and the name is second (separated by 2 spaces) – Drago Ram Sep 08 '21 at 08:03
As you can see from my example, using nlargest on a DataFrame if 100 million rows and n is 1 million returned results in 3 seconds in my machine. So I would still recommend using it. – Ofek Glick Sep 08 '21 at 08:04
ok thank you very much, could you also explain how to parse the txt file to a DataFrame – Drago Ram Sep 08 '21 at 08:07
I added a link to a question that explains it. If my explanation answered your question please mark it as accepted. – Ofek Glick Sep 08 '21 at 08:08
What if the input data is given as stdin (data is in the same format), can you tell me how to parse such an input to a DataFrame ? – Drago Ram Sep 08 '21 at 08:13
This requires a different question as it is not relevant here. please post a different question. – Ofek Glick Sep 08 '21 at 08:18

How to obtain the names of N highest scorers given a large dataset of student scores?

1 Answers1