How do I extract data from the 3rd column of multiple large text files based on first two matching columns?

Question

I have 3 large text files (>16 M lines each) in the format below.

Contents of file 1:

22_0F3, 33_0F4, 0.87
28_0F3, 37_0F4, 0.79
21_0F5, 39_2F1, 0.86

Contents of file 2:

22_0F3, 33_0F4, 1000
28_0F3, 37_0F4, 1500
21_0F2, 52_2F8, 3600

Contents of file 3:

22_0F3, 33_0F4, 0.75
28_0F3, 37_0F4, 0.91
81_0F2, 32_2F1, 0.84

I'm trying to extract the common lines based on 1st two fields from these 3 files.

Then I have to find the square root of the squares of each corresponding value from the 3rd column in each line (explained below).

The difficulty is that since these text files are really huge with more than 16 million lines, it is taking more time to load and extract common lines.

The common lines would be around 15M based on the data I have.

Intermediate output is something like this:

22_0F3, 33_0F4, 0.87, 1000, 0.75
28_0F3, 37_0F4, 0.79, 1500, 0.91

The desired output is:

22_0F3, 33_0F4, 1000.7575
28_0F3, 37_0F4, 1500.6245

where 1000.7575 is the square root of sum of squares of 0.87, 1000, and 0.75.

How can I get the desired output from these huge files without much delay?

Possible duplicate of [extract column value based on another column pandas dataframe](https://stackoverflow.com/questions/36684013/extract-column-value-based-on-another-column-pandas-dataframe) — Mike, May 31 '19 at 19:28
@Sara S - So are you saying that your original file is 16M lines, and that the number of matching lines is 15M? So there are only 1M non-matching lines? — Rob Truxal, May 31 '19 at 19:30
@Mike I did checked that. But the issue here is the size of files. — Sara S, May 31 '19 at 19:33
If you have to operate on all matching lines, there is no quick way to do this. If I were you, I would create 3 new files with ~100 lines a piece, & which have the same format as your 16M-line files. I would then use those to test my algorithm & set it loose on the massive files when I was confident it would work. Also, I would probably end up writing my intermediate & final results to a disk instead of keeping them in a python object, so I could prevent running out of RAM half way through. — Rob Truxal, May 31 '19 at 19:37
Your expected results do not correspond to the square root of the sum of squares. They seem to be closer to *f(x,y,z) = √(y^2 + 1/z^2) + x^2* because `sqrt(1000**2 + 1/0.75)+0.87**2=1000.75757` and `sqrt(1500**2 + 1/0.91)+0.79**2=1500.62446` instead of 1000.00065 and 1500.00048 which would be the results of *√∑n^2* — Alain T., Jun 03 '19 at 01:27

Alain T. · Accepted Answer · 2019-06-01T01:16:22.620

You could scan the 3 files and build a dictionary in memory with the code pair as key and the list of numbers as value. A dictionary with 16 million small items should be no problem for most modern PCs. Then go through the dictionary, make your calculation on items that have 3 values in the list and write them to the output file.

# Aggregation (build dictionary of lists) ...

from collections import defaultdict
data = defaultdict(list)
for fileName in ["file1.txt","file2.txt","file3.txt"]:
    with open(fileName,'r') as lines:
        for line in lines:
            col1,col2,value = line.split(",") 
            if col1>col2 : col1,col2 = col2,col1 # match codes in any order
            data[(col1,col2)].append(float(value))

# Calculation and output...

from math import sqrt
with open("output.txt","w") as output:
    for (col1,col2),values in data.items():
        if len(values) < 3: continue             # must have the 3 matches
        result = sqrt(sum( x*x for x in values)) # use your formula here                       
        output.write(f"{col1},{col2}, {result}\n")

This takes 68 seconds for 15,000,000 matching lines on my laptop. (but I have an SSD, a hard disk drive may take much longer)

Note that I used the square root of the sum of squares in my calculation. Based on your example, that's obviously not the right formula because √(0.87^2 + 1000^2 + 0.75^2) is 1000.0006597 not 1000.7575. I assume you will replace my √∑n^2 formula with your own calculation for the result.

I would like to ask a question. If I have line 22_0F3, 33_0F4, 0.87 in file1, 22_0F3, 33_0F4, 0.75 in file3 but 33_0F4, 22_0F3, 1000 in file2 which part of the above code should be changed? — Sara S, Jun 01 '19 at 00:01
So, if I understand correctly, you want the pairs to match even if they are inverted. In the aggregation part, before appending the value, add a line to ensure that col1 and col2 are in ascending order: `col1,col2 = min(col1,col2),max(col1,col2)` or something equivalent — Alain T., Jun 01 '19 at 01:15

How do I extract data from the 3rd column of multiple large text files based on first two matching columns?

1 Answers1