I have a text file fo several GB with this format
0 274 593869.99 6734999.96 121.83 1,
0 273 593869.51 6734999.92 121.57 1,
0 273 593869.15 6734999.89 121.57 1,
0 273 593868.79 6734999.86 121.65 1,
0 273 593868.44 6734999.84 121.65 1,
0 273 593869.00 6734999.94 124.21 1,
0 273 593868.68 6734999.92 124.32 1,
0 273 593868.39 6734999.90 124.44 1,
0 273 593866.94 6734999.71 121.37 1,
0 273 593868.73 6734999.99 127.28 1,
I have a simple function to filter in Python 2.7 on Windows. The function reads the entire file, selects the line with the same idtile
(first and second column) and returns the list of points (x,y,z, and label) and the idtile
.
tiles_id = [j for j in np.ndindex(ny, nx)] #ny = number of row, nx= number of columns
idtile = tiles_id[0]
def file_filter(name,idtile):
lst = []
for line in file(name, mode="r"):
element = line.split() # add value
if (int(element[0]),int(element[1])) == idtile:
lst.append(element[2:])
dy, dx = int(element[0]),int(element[1])
return(lst, dy, dx)
The file is more than 32 GB and the bottle-neck is the reading of the file. I am looking for some suggestions or examples in order to speed up my function (ex: Parallel computing or other approaches).
My solution is to split the text file into tiles (using x and y location). The solution is not elegant and I am looking for an efficient approach.