2

I have a very large file (10 GB) with approximately 400 billion lines, it is a csv with 4 fields. Here the description, the first field is an ID and the second is a current position of the ID, the third field is a correlative number assigned to that row. Similar to this:

41545496|4154|1
10546767|2791|2
15049399|491|3
38029772|491|4
15049399|1034|5

My intention is to create a fourth column (old position) in another file or the same, where the previous position in which your ID was stored is stored, what I do is verify if the ID number has already appeared before, I look for its last appearance and assigned to his field of old position, the position he had in the last appearance. If the ID has not appeared before, then I assign to its old position the current position it has in that same row. Something like this:

41545496|4154|1|4154
10546767|2791|2|2791
15049399|491|3|491
38029772|491|4|491
15049399|1034|5|491

I have created a program that does the reading and analysis of the file but performs a reading of 10 thousand lines every minute, so it gives me a very high time to read the entire file, more than 5 days approximately.

import pandas as pd

with open('file_in.csv', 'rb') as inf:
    df = pd.read_csv(inf, sep='|', header=None)

cont = 0
df[3] = 0

def test(x):
    global cont

    a = df.iloc[:cont,0]

    try:
        index = a[a == df[0][cont]].index[-1]
        df[3][cont] = df[1][index] 
    except IndexError:
        df[3][cont] = df[1][cont]
        pass
    cont+= 1

df.apply(test, axis=1)

df.to_csv('file_out.csv', sep='|', index=False, header=False)

I have a computer 64 processors with 64 GB of RAM in the university but still it's a long time, is there any way to reduce that time? thank you very much!

Ignacio Vergara Kausel
  • 5,521
  • 4
  • 31
  • 41
Wolf
  • 41
  • 7
  • just a note, you don't have to open the file to let pandas process it. – Ignacio Vergara Kausel Feb 27 '18 at 14:55
  • are you on a 32-bit version of Python or 64? – MattR Feb 27 '18 at 14:56
  • Python 64-bit version @MattR – Wolf Feb 27 '18 at 14:58
  • Have you tried `dask.dataframe`? This enables you to use many `pandas`-type functions out-of-memory, then bring back only your results into memory. – jpp Feb 27 '18 at 15:02
  • @jpp I also thought about that, don't think it'll work since his problem is not nice to be chunked because rows are not independent. You'd have to find a way to calculate the offsets when rejoining the chunks. – Ignacio Vergara Kausel Feb 27 '18 at 15:04
  • 2
    My suggestion, then, would be to avoid pandas altogether, and create a HDF5 file out of this. Then perform all your calculations out-of-core using `h5py` and `numpy` libraries. The file size after compression should shrink to smaller than half the csv size. – jpp Feb 27 '18 at 15:08
  • Parsing text files is always slow (a few MB/s is a not a bad speed for text parsing). A small example how to get a few hundret MB/s: https://stackoverflow.com/a/48997927/4045774 – max9111 Feb 27 '18 at 19:04
  • 1
    I am working on an answer. The lookup for the last appearance can be improved with a stable argsort. – max9111 Feb 27 '18 at 19:33

1 Answers1

0

Processing the data efficiently

You have two main problems in your approach:

  • That amount of data should have never been written to a text file
  • Your approach needs (n^2/2) comparisons

A better idea, is to index-sort your array first before doing the actual work. So you need only about 2n operations for comparisons and n*log(n) operations for sorting in the worst case.

I also used numba to compile that function which will speed up the computation time by a factor of 100 or more.

import numpy as np

#the hardest thing to do efficiently
data = np.genfromtxt('Test.csv', delimiter='|',dtype=np.int64)

#it is important that we use a stable sort algorithm here
idx_1=np.argsort(data[:,0],kind='mergesort')
column_4=last_IDs(data,idx_1)

#This function isn't very hard to vectorize, but I expect better
#peformance and easier understanding when doing it in this way
import numba as nb
@nb.njit()
def last_IDs(data,idx_1):
  #I assume that all values in the second column are positive
  res=np.zeros(data.shape[0],dtype=np.int64) -1
  for i in range(1,data.shape[0]):
    if (data[idx_1[i],0]==data[idx_1[i-1],0]):
      res[idx_1[i]]=data[idx_1[i-1],1]

  same_ID=res==-1
  res[same_ID]=data[same_ID,1]

  return res

For performant writing and reading data have a look at: https://stackoverflow.com/a/48997927/4045774

If you don't get at least 100 M/s IO-speed, please ask.

max9111
  • 6,272
  • 1
  • 16
  • 33