Find/replace in large datasets with Python

Question

I have a 3GB file a.txt of the form:

a 20
g 33
e 312
....

And a b.txt file which is a map to the alphabet in a.txt:

e elephant
a apple
g glue
....

I'd like to merge these two files to create c.txt like:

apple 20
glue 33
elephant 312
...

I have tried to write a simple for loop to do that but failed. When I run the python file, it runs 2 seconds and stop.

What exactly makes it impossible to write a simple `for` loop? Just don't load the entire file into memory. Anyways, this has been done a million times, let me look for a duplicate. — Nelewout, Jan 12 '16 at 16:51
I've flagged too much today, but [this](http://stackoverflow.com/questions/11555468/how-should-i-read-a-file-line-by-line-in-python) is pretty much a duplicate. — Nelewout, Jan 12 '16 at 16:55
@N.Wouda I used this all the time but it is quite slow when the dataset is bigger than normal — user5779223, Jan 12 '16 at 17:04
I am curious how "a" could be so big, there are only 26 letters in the alphabet, What is missing here — PyNEwbie, Jan 12 '16 at 17:05
Well, you are reading lots of data from the filesystem, so you really cannot expect great performance (disk IO is just slow). — Nelewout, Jan 12 '16 at 17:06
@PyNEwbie the first column of `a.txt` probably does not contain unique values (so there will be many entries of `a`, `g` etc.). It's just a record. — Nelewout, Jan 12 '16 at 17:06
So then what is the output supposed to be, this is easy if there are only 26 keys, it is actually very trivial. We could process one set of keys at a time open a, consume only the lines that have a-g,, open b consume only lines that have a-g, etc. but then the output is uncertain, I am confused, if they are not in order and a is a key multiple times how do I know I have 20 apples? — PyNEwbie, Jan 12 '16 at 17:09
@PyNEwbie open `b.txt`, build the mapping required for `a.txt` as a dictionary, then process `a.txt` line-by-line and write the mapped values to `c.txt`. Want to earn some simple rep and put this into an answer? I cannot be bothered, as this really should be closed as a dupe. — Nelewout, Jan 12 '16 at 17:12
@PyNEwbie Actually, that is not exactly the situation is. I simplify the case for convenience. There are unique strings of the items in a.txt and another set of unique strings in b.txt. — user5779223, Jan 12 '16 at 17:14
I think there is still some confusion about what the op wants. so in a we have a_20, but we also have a_n, if the keys are not ordered the same across a & b, what determines the assignment, it it first to first, if so it is still trivial. I think this question needs some more detail — PyNEwbie, Jan 12 '16 at 17:15
so the question is are the "keys' unique - is there only one value for "zsd" in a and one for zsd in b? — PyNEwbie, Jan 12 '16 at 17:16
@PyNEwbie Ignore the fact that there are just 26 alphabets. The key and value are one-to-one and they are determined in the b.txt — user5779223, Jan 12 '16 at 17:30

randomusername · Answer 1 · 2016-01-12T17:07:22.443

0

This can be done with dictionaries like so

mapping = {}
with open('b.txt') as f:
  for line in f:
    key, value = line.split()
    mapping[key] = value
with open('a.txt') as i:
  with open('c.txt', 'w') as o:
    for line in i:
      key, value = line.split()
      if key in mapping:
        print(value, mapping[key], file=o)

So what if a.txt is 3GB? On a modern desktop computer this will still run very quickly

edited Jan 12 '16 at 17:07

answered Jan 12 '16 at 16:57

randomusername

7,927
23
50

Are you sure you're not mistaking the two files? Shouldn't `a` and `b` be opened the other way around? – Nelewout Jan 12 '16 at 17:04
@N.Wouda ok, i did have them mixed up. Good catch! – randomusername Jan 12 '16 at 17:09
@randomusername I tried in ipython and it even stop to work – user5779223 Jan 12 '16 at 17:45

PyNEwbie · Answer 2 · 2016-01-12T17:42:26.867

well to strictly answer your question, this will read in a.txt, line by line, scan b for a match, write it out, close b, read the next line in a.txt, open b again etc. This should only read one line in a at a time. I am inferring that there is a one-to-one non-ordered match.

def process(a,b,outpath):
    outref = open(outpath,'w')
    with open(a,'r') as fh:
        for line in fh:
            key,value = line.split()
            with open(b,'r') as fh_b:
                for b_line in fh_b:
                    bkey, bvalue = b_line.split()
                    if bkey == key:
                        outref.write(bvalue.strip() + ' ' + value.strip() + '\n')
                        continue
    outref.close()
    return

Find/replace in large datasets with Python

2 Answers2