-2

I have a .txt file containing a large dataset (more than 90 million entries) in the following format:

Score Student Name
35 Lily
45 Rex
20 Cameron
45 Max
20 Jasmin

In the text file, the score and the name are separated by 2 spaces and has one score-name entry per line

This .txt file cannot be loaded into the memory at a time.

How to obtain the first N highest scorers in python ?
Note: the value of N can be extremely large

Example:
So when N=2,

the output should be :
Rex
Max

Is there a way in python to directly obtain the first N scorers without saving the whole data again in another file format ?

Which way is more efficient ?
1.) read score entries one by one and save/update the largest N score entries?
2.) move all data to pandas dataframes and use nlargest ?

Drago Ram
  • 3
  • 2

1 Answers1

1

To read the text file into a pandas DataFrame the answer to that is here.
Then you can try the following: you can try using pandas nlargest. For example you can do:

largest = df.nlargest(n,'score')['Student Name']

You can also convert the score column to a Numpy array and use argsort

import numpy as np
largest = df.iloc[np.argsort(-df['score'])[:n]]['Student Name']

Additionally you can try sorting the DataFrame and take the top n rows like so:

largest = df.sort_values('score', ascending=False).iloc[:n]['Student Name']

Here is the comparison of runtime for a DataFrame with 100 million records and n=1000000

 df = pd.DataFrame(np.random.randint(0, 100, size=(100000000, 2)),  columns=['score', 'Student Name'])
 n = 1000000
 start = time()
 temp = df.nlargest(n, 'score')['Student Name']
 print(time() - start)

 start = time()
 temp2 = df.iloc[np.argsort(-df['score'])[:n]]['Student Name']
 print(time() - start)

 start = time()
 temp3 = df.sort_values('score', ascending=False).iloc[:n]['Student Name']
 print(time() - start)

Resluts:

3.5889642238616943
13.237002849578857
19.69099760055542

So the most efficient way would be to use nlargest

Ofek Glick
  • 999
  • 2
  • 9
  • 20
  • Since, the whole datatset cannot be fit into the memory, so is it better to first convert the text file data into pandas dataframes and use nlargest, or is there a way to do it directly ? – Drago Ram Sep 08 '21 at 07:38
  • `pandas` is not a file format but a python Object, please provide more information about the problem you are having, what you have tried and where did you encounter trouble. – Ofek Glick Sep 08 '21 at 07:41
  • 1
    Also, since there are only 1 million entries in your text file they should easily be uploaded to a pandas DataFrame – Ofek Glick Sep 08 '21 at 07:52
  • 1
    One last thing, if want an explanation on how to parse the .txt file into a pandas DataFrame please provide more information as to how it is saved – Ofek Glick Sep 08 '21 at 07:53
  • I updated the question, I actually have more than 90 million entries and they dont fit in my memory – Drago Ram Sep 08 '21 at 07:59
  • In the text file there is one entry per line, in each line the score comes first and the name is second (separated by 2 spaces) – Drago Ram Sep 08 '21 at 08:03
  • As you can see from my example, using nlargest on a DataFrame if 100 million rows and n is 1 million returned results in 3 seconds in my machine. So I would still recommend using it. – Ofek Glick Sep 08 '21 at 08:04
  • ok thank you very much, could you also explain how to parse the txt file to a DataFrame – Drago Ram Sep 08 '21 at 08:07
  • I added a link to a question that explains it. If my explanation answered your question please mark it as accepted. – Ofek Glick Sep 08 '21 at 08:08
  • What if the input data is given as stdin (data is in the same format), can you tell me how to parse such an input to a DataFrame ? – Drago Ram Sep 08 '21 at 08:13
  • This requires a different question as it is not relevant here. please post a different question. – Ofek Glick Sep 08 '21 at 08:18