2

I have two text files (A and B), like this:

A:
1 stringhere 5
1 stringhere 3
...
2 stringhere 4
2 stringhere 4
...

B:
1 stringhere 4
1 stringhere 5
...
2 stringhere 1
2 stringhere 2
...

What I have to do is read the two files, than do a new text file like this one:

1 stringhere 5
1 stringhere 3
...
1 stringhere 4
1 stringhere 5
...
2 stringhere 4
2 stringhere 4
...
2 stringhere 1
2 stringhere 2
...

Using for loops, i created the function (using Python):

def find(arch, i):
    l = arch   
    for line in l:
        lines = line.split('\t')
        if i == int(lines[0]):
           write on the text file
        else:            
            break

Then I call the function like this:

for i in range(1,3):        
    find(o, i)
    find(r, i)  

What happens is that I lose some data, because the first line that contains a different number is read, but it's not on the final .txt file. In this example, 2 stringhere 4 and 2stringhere 1 are lost.

Is there any way to avoid this?

Thanks in advance.

Bruno A
  • 117
  • 1
  • 13

3 Answers3

2

If the files fit in memory:

with open('A') as file1, open('B') as file2:
     L = file1.read().splitlines() 
     L.extend(file2.read().splitlines()) 
L.sort(key=lambda line: int(line.partition(' ')[0])) # sort by 1st column
print("\n".join(L)) # print result

It is an efficient method if total number of lines is under a million. Otherwise and especially if you have many sorted files; you could use heapq.merge() to combine them.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
2

In your loop, when the line does not start with the same value as i you break, but you have already consumed one line so when the function is called a second time with i+1, it starts at the second valid line.

Either read the whole files in memory beforehands (see @J.F.Sebastian 's answer), or, if that is not an option, replace your function with something like:

def find(arch, i):
    l = arch
    while True:
        line=l.readline()
        lines = line.split('\t')
        if line != "" and i == int(lines[0]): # Need to catch end of file
            print " ".join(lines),
        else:
            l.seek(-len(line), 1) # Need to 'unread' the last read line
            break

This version 'rewinds' the cursor so that the next call to readline reads the correct line again. Note that mixing the implicit for line in l with the seek call is disouraged, hence the while True.

Exemple:

$ cat t.py
o = open("t1")
r = open("t2")
print o
print r


def find(arch, i):
    l = arch
    while True:
        line=l.readline()
        lines = line.split(' ')
        if line != "" and i == int(lines[0]):
            print " ".join(lines),
        else:
            l.seek(-len(line), 1)
            break

for i in range(1, 3):
    find(o, i)
    find(r, i)

$ cat t1 
1 stringhere 1
1 stringhere 2
1 stringhere 3
2 stringhere 1
2 stringhere 2
$ cat t2
1 stringhere 4
1 stringhere 5
2 stringhere 1
2 stringhere 2
$ python t.py
<open file 't1', mode 'r' at 0x100261e40>
<open file 't2', mode 'r' at 0x100261ed0>
1 stringhere 1
1 stringhere 2
1 stringhere 3
1 stringhere 4
1 stringhere 5
2 stringhere 1
2 stringhere 2
2 stringhere 1
2 stringhere 2
$ 
damienfrancois
  • 52,978
  • 9
  • 96
  • 110
  • I'm getting `ValueError: invalid literal for int() with base 10: ''` using this. Am I doing something wrong? – Bruno A Nov 10 '13 at 16:33
  • 1
    I used spaces as delimiter in my tests, rather than tabulations. I just updated the code to fit with your initial code snippets. Maybe that was the problem? If a line is empty, you can get that error, too, but the `if line != ""`part is supposed to avoid them. – damienfrancois Nov 10 '13 at 16:37
  • I was using '\t' here too, and there is no empty lines in the text file. The first lines are printed, but the error occurs exactly when the first number changes from 1 to 2. – Bruno A Nov 10 '13 at 16:41
  • Can you reproduce my example? – damienfrancois Nov 10 '13 at 16:47
  • I managed to get it working, seems like i forgot something about your answer! Thanks a lot! – Bruno A Nov 10 '13 at 16:51
2

There may be a less complicated way to accomplish this. The following also keeps the lines in the order they appear in the files, as it appears you want to do.

lines = []
lines.extend(open('file_a.txt').readlines())
lines.extend(open('file_b.txt').readlines())
lines = [line.strip('\n') + '\n' for line in lines]
key = lambda line: int(line.split()[0])
open('out_file.txt', 'w').writelines(sorted(lines, key=key))

The first three lines read the input files into a single array of lines.

The fourth line ensures that each line has exactly one newline at the end. If you're sure both files will end in a newline, you can omit this line.

The fifth line defines the key for sorting as the integer version of the first word of the string.

The sixth line sorts the lines and writes the result to the output file.

Pi Marillion
  • 4,465
  • 1
  • 19
  • 20
  • don't use `cmp`, use [`key` as in my answer](http://stackoverflow.com/a/19891465/4279) instead. `cmp` is removed in Python 3 (for a reason). Don't use `"\r\n"`: Python translates `"\n"` to `"\r\n"` while writing on Windows for you automatically. – jfs Nov 10 '13 at 16:20
  • `sorted(lines)` might break numeric sort. Compare `sorted("1 10 9".split())` and `sorted(map(int, "1 10 9".split()))` – jfs Nov 10 '13 at 16:28
  • @JFSebastian Thanks for your help! I've incorporated both of your suggestions in the answer. – Pi Marillion Nov 10 '13 at 17:36