2

I have a 3GB file a.txt of the form:

a 20
g 33
e 312
....

And a b.txt file which is a map to the alphabet in a.txt:

e elephant
a apple
g glue
....

I'd like to merge these two files to create c.txt like:

apple 20
glue 33
elephant 312
...

I have tried to write a simple for loop to do that but failed. When I run the python file, it runs 2 seconds and stop.

user5779223
  • 1,460
  • 3
  • 21
  • 42
  • 1
    What exactly makes it impossible to write a simple `for` loop? Just don't load the entire file into memory. Anyways, this has been done a million times, let me look for a duplicate. – Nelewout Jan 12 '16 at 16:51
  • @N.Wouda how can I do that? – user5779223 Jan 12 '16 at 16:51
  • @N.Wouda Thank you so much. – user5779223 Jan 12 '16 at 16:53
  • 2
    I've flagged too much today, but [this](http://stackoverflow.com/questions/11555468/how-should-i-read-a-file-line-by-line-in-python) is pretty much a duplicate. – Nelewout Jan 12 '16 at 16:55
  • @N.Wouda I used this all the time but it is quite slow when the dataset is bigger than normal – user5779223 Jan 12 '16 at 17:04
  • I am curious how "a" could be so big, there are only 26 letters in the alphabet, What is missing here – PyNEwbie Jan 12 '16 at 17:05
  • Well, you are reading lots of data from the filesystem, so you really cannot expect great performance (disk IO is just slow). – Nelewout Jan 12 '16 at 17:06
  • @PyNEwbie the first column of `a.txt` probably does not contain unique values (so there will be many entries of `a`, `g` etc.). It's just a record. – Nelewout Jan 12 '16 at 17:06
  • So then what is the output supposed to be, this is easy if there are only 26 keys, it is actually very trivial. We could process one set of keys at a time open a, consume only the lines that have a-g,, open b consume only lines that have a-g, etc. but then the output is uncertain, I am confused, if they are not in order and a is a key multiple times how do I know I have 20 apples? – PyNEwbie Jan 12 '16 at 17:09
  • @PyNEwbie open `b.txt`, build the mapping required for `a.txt` as a dictionary, then process `a.txt` line-by-line and write the mapped values to `c.txt`. Want to earn some simple rep and put this into an answer? I cannot be bothered, as this really should be closed as a dupe. – Nelewout Jan 12 '16 at 17:12
  • @PyNEwbie Actually, that is not exactly the situation is. I simplify the case for convenience. There are unique strings of the items in a.txt and another set of unique strings in b.txt. – user5779223 Jan 12 '16 at 17:14
  • I think there is still some confusion about what the op wants. so in a we have a_20, but we also have a_n, if the keys are not ordered the same across a & b, what determines the assignment, it it first to first, if so it is still trivial. I think this question needs some more detail – PyNEwbie Jan 12 '16 at 17:15
  • so the question is are the "keys' unique - is there only one value for "zsd" in a and one for zsd in b? – PyNEwbie Jan 12 '16 at 17:16
  • @PyNEwbie Ignore the fact that there are just 26 alphabets. The key and value are one-to-one and they are determined in the b.txt – user5779223 Jan 12 '16 at 17:30

2 Answers2

0

This can be done with dictionaries like so

mapping = {}
with open('b.txt') as f:
  for line in f:
    key, value = line.split()
    mapping[key] = value
with open('a.txt') as i:
  with open('c.txt', 'w') as o:
    for line in i:
      key, value = line.split()
      if key in mapping:
        print(value, mapping[key], file=o)

So what if a.txt is 3GB? On a modern desktop computer this will still run very quickly

randomusername
  • 7,927
  • 23
  • 50
0

well to strictly answer your question, this will read in a.txt, line by line, scan b for a match, write it out, close b, read the next line in a.txt, open b again etc. This should only read one line in a at a time. I am inferring that there is a one-to-one non-ordered match.

def process(a,b,outpath):
    outref = open(outpath,'w')
    with open(a,'r') as fh:
        for line in fh:
            key,value = line.split()
            with open(b,'r') as fh_b:
                for b_line in fh_b:
                    bkey, bvalue = b_line.split()
                    if bkey == key:
                        outref.write(bvalue.strip() + ' ' + value.strip() + '\n')
                        continue
    outref.close()
    return 
PyNEwbie
  • 4,882
  • 4
  • 38
  • 86