Python: Iterating Large Files

Question

I am a beginner and will appreciate any alternatives to handle my problem. Simply put, I have two files, containing one vector each. Aim is to subtract all the elements of file 2 from file 1; for all possible combinations. Everything is fine for small vectors, everything is fine, but the processing time is huge for larger file such as with million elements in each file.

Given below is the minimal working example. I heard about memorymapping and would appreciate if you can share a modified version or any relevant pointers to handle this issue.

import matplotlib.pyplot as plt
import numpy as np

file1 = np.loadtxt('File_1.txt',delimiter=',')
file2 = np.loadtxt('File_2.txt',delimiter=',')
event1 = file1[:,1]
event2 = file2[:,1]

r1 = len(event1)
r2 = len(event2)

diff = []

for i in range(0,r1):

    for j in range(0,r2):
        delta = float(event1[i]-event2[j])

        if delta >=-4000 and delta <=4000:
            diff = np.append(diff, delta)
            

np.savetxt('export.txt', diff)

a million elements in each file is 10**12 combinations if you are just doing it brute force. You can't do it like that. — Christian Sloper, Jun 25 '20 at 21:48
_would appreciate if you can share a modified version or any relevant pointers to handle this issue._ I believe that’s off-topic. Looking at the code, is the file IO even the issue? If your code involve lots of iterating and appending, NumPy arrays might not be the appropriate choice of data structure. Have you done any profiling? Please see [ask], [help/on-topic]. — AMC, Jun 25 '20 at 21:49
Hi Christain, Yeah, I am not able to figure a way out since I am not a programming expert. Will appreciate any assistance. Thanks! — nuki, Jun 25 '20 at 21:50
Ok, i am going to bed now so ill just give you a quick tip. If this isn't solved tomorrow, ill type a proper answer. You need to load one file and sort it. Then for each element in the other file you can use binary search to find the range of elements that fit your delta. that will be O(n log n) which is a lot lot faster. — Christian Sloper, Jun 25 '20 at 21:53
Hi AMC, Didn't know about profiling, searching now on Google, can you please elaborate more keywords that I can search and learn? — nuki, Jun 25 '20 at 21:53
CAn you post small sample input files and the expected result? — tdelaney, Jun 25 '20 at 21:57
Hi Tdelaney, Here are the files: https://drive.google.com/drive/folders/1-iKixHtJITv0atK-Juwt8gSnnjEIVsfv?usp=sharing — nuki, Jun 25 '20 at 22:06

Christian Sloper · Accepted Answer · 2020-06-26T11:25:25.450

1

Here is a solution that is O(n log m + number_of_matches). This can still default back to O(n*m) since the number of possible outputs is that big (if all elements are close to each other in value). If there are few matches this would be much faster:

#two pointer approach

event1 = sorted(event1)
event2 = sorted(event2)


diff = []

#smallest 
if len(event1) > len(event2):
    event1, event2 = event2,event1

left  = 0
right = 0
for e in event1:
    # move left pointer so that is within -delta of e
    while left < len(event2) and event2[left] - e < -4000:
        left +=1
    #move right pointer so that is outside of +delta
    while right < len(event2) and event2[right] - e <= 4000:
        right +=1

    for i in range(left,right):
        diff.append(e - event2[i])

Testing on my machine, this is about 6 times faster on the example files. It will be a lot faster if delta is relatively small compared to the numbers (few hits) and approximately the same (or even slower) if delta is very big (many hits).

edited Jun 26 '20 at 11:25

answered Jun 26 '20 at 08:33

Christian Sloper

7,440
3
15
28

The last for loop can be replaced with diff.append(e - event2[left: right] – Pramote Kuacharoen Jun 26 '20 at 11:08
Christian: Very neat, I read about binary search yesterday and the solution now makes sense to me. Life Saver! – nuki Jun 26 '20 at 18:06
Pramote: I tried your suggestion. However I encounter an error when saving using np.savetxt('export3.txt', diff). Do you know what is going on here? See error.txt here: https://drive.google.com/drive/folders/1-iKixHtJITv0atK-Juwt8gSnnjEIVsfv?usp=sharing – nuki Jun 26 '20 at 18:13
Hi Christian, The append method is now giving me a memory error for size > million events in each file, do you have any recommendations? – nuki Jul 15 '20 at 20:52
You can save to file every so often when diff gets large and begin a new table. (Keep memory smaller by using disk). – Christian Sloper Jul 16 '20 at 08:56
@ChristianSloper: So, I followed your suggestion. I created a HDF5 file, which is now storing the delta in its dataset. But it is really really slow for mere 100 MBs of data. This will be even worse for GBs of data. Please check: https://stackoverflow.com/questions/63177247/python-fastest-way-to-subtract-elements-of-datasets-of-hdf5-file – nuki Jul 30 '20 at 16:55
You are really trying to store huge amounts of combinations. Remember that this can be as many as m*n (which is trillions if m and n are in millions). Do you have to store every combination, or is it good enough to have all elements of array1 s.t. exist at least one element within delta in array2? – Christian Sloper Jul 30 '20 at 21:42

score 0 · Answer 2 · answered Jun 26 '20 at 00:33

0

This may help a little bit.

import numpy as np
file1 = np.loadtxt('File_1.txt', delimiter=',')
file2 = np.loadtxt('File_2.txt', delimiter=',')
event1 = file1[:, 1]
event2 = file2[:, 1]

diff = []
for item in event2:
    event_delta = event1 - item
    event_delta_filter = list(filter(lambda x: -4000 <= x <= 4000, event_delta))
    diff = np.append(diff, event_delta_filter)

print(diff)

answered Jun 26 '20 at 00:33

Pramote Kuacharoen

1,496
1
5
6

Hi Pramote, Though I appreciate your reply, I still do not see much variation in computation time for my million events in each file. Do you have any other recommendation? – nuki Jun 26 '20 at 03:16
The time complexity is O(n*m). If vectors are similar sizes, it is O(n^2). It takes time to run. The more data you have, the more time it takes. In your case, if you double the size of your data, it will take four times longer. The above code tries to use as much NumPy features as possible. If you want a better computational performance, you should switch the language to C, C++, C#, or Java. – Pramote Kuacharoen Jun 26 '20 at 03:38

Python: Iterating Large Files

2 Answers2