I have a very large file (10 GB) with approximately 400 billion lines, it is a csv with 4 fields. Here the description, the first field is an ID and the second is a current position of the ID, the third field is a correlative number assigned to that row. Similar to this:
41545496|4154|1
10546767|2791|2
15049399|491|3
38029772|491|4
15049399|1034|5
My intention is to create a fourth column (old position) in another file or the same, where the previous position in which your ID was stored is stored, what I do is verify if the ID number has already appeared before, I look for its last appearance and assigned to his field of old position, the position he had in the last appearance. If the ID has not appeared before, then I assign to its old position the current position it has in that same row. Something like this:
41545496|4154|1|4154
10546767|2791|2|2791
15049399|491|3|491
38029772|491|4|491
15049399|1034|5|491
I have created a program that does the reading and analysis of the file but performs a reading of 10 thousand lines every minute, so it gives me a very high time to read the entire file, more than 5 days approximately.
import pandas as pd
with open('file_in.csv', 'rb') as inf:
df = pd.read_csv(inf, sep='|', header=None)
cont = 0
df[3] = 0
def test(x):
global cont
a = df.iloc[:cont,0]
try:
index = a[a == df[0][cont]].index[-1]
df[3][cont] = df[1][index]
except IndexError:
df[3][cont] = df[1][cont]
pass
cont+= 1
df.apply(test, axis=1)
df.to_csv('file_out.csv', sep='|', index=False, header=False)
I have a computer 64 processors with 64 GB of RAM in the university but still it's a long time, is there any way to reduce that time? thank you very much!