0

If I have two files with DNA sequences and everything is in order (IDs so finding the correct sequence is easy), how do I just merge the two lines into 1 consensus? My example below (not using DNA sequences so it's easier to read)

Note: all ids are identical in the same order, and the length of the sequences are the same. For example if I have file A with:

>id1
THISISA-----
>id2
HELLO-------
>id3
TESTTESTTEST

And a second file B with:

>id1
-------TEST!
>id2
-----WORLD!!
>id3
TESTTESTTEST

My ideal output is simply (in a new file C):

>id1
THISISATEST!
>id2
HELLOWORLD!!
>id3
TESTTESTTEST

I am terrible with strings in python, and so far I've just managed to open each file with readlines and save the content. Essentially, gaps are identified with "-" and if there is a character in either file that can replace the hyphen, I want it to do that.

Just tips on how to start is appreciated, I don't have code to provide other than:

import os
import sys
file1 = sys.argv[1]
file2 = sys.argv[2]

file1_seqs = []
file1_ids = []

with open(file1, "r") as f1:
    content1 = f1.readlines()
for i in range(len(content1)):
    if i % 2 == 1: # get the DNA sequence
        msa1_seqs.append(content1[i])
    else:
        msa1_ids.append(content1[i])

Repeated the above code to open the second file (file2) and kept the text in lists msa2_seqs and msa2_ids. Now I am just stuck in trying to call the write elements at the same time so I can create another loop to change "-" into characters if any other character exists.

CuriousDude
  • 1,087
  • 1
  • 8
  • 21
  • Oh yes I would, but essentially each line can be matched because even >id1 from both files should essentially return the same thing since they are identical with no gaps. I will edit my question to include this. – CuriousDude Apr 25 '19 at 01:08
  • Maybe I should treat this as a dataframe problem and use pandas? – CuriousDude Apr 25 '19 at 01:25

2 Answers2

1

You can first collect your lines by >id{int} in a collections.defaultdict, then output the grouped lines to a file. This method will also work if you have more than two files as well.

It also seems that you don't want to concatenate strings that are the same. If this is the case and and you also want to preserve order, you can use collections.OrderedDict from the Python standard library with just keys.

However, as Python 3.7 (and CPython 3.6), the standard dict is guaranteed to preserve order. If this is the version of python you are using, then there is no need to use OrderedDict, otherwise you can keep using it for portability reasons.

Demo:

from collections import defaultdict
from collections import OrderedDict

def collect_lines(dic, file, key, delim):
    curr_key = None

    for line in file:
        line = line.strip()

        # Check if new key has been found
        if line.startswith(key):
            curr_key = line
            continue

        # Otherwise add line with delim replaced
        dic[curr_key].append(line.replace(delim, ""))

d = defaultdict(list)

files = ["A.txt", "B.txt"]

# Collect lines from each file
for file in files:
    with open(file) as fin:
        collect_lines(dic=d, file=fin, key=">id", delim="-")

# Write new content to output
with open("output.txt", mode="w") as fout:
    for k, v in d.items():
        fout.write("%s\n%s\n" % (k, "".join(OrderedDict.fromkeys(v))))

output.txt:

>id1
THISISATEST!
>id2
HELLOWORLD!!
>id3
TESTTESTTEST
RoadRunner
  • 25,803
  • 6
  • 42
  • 75
1

You can iterate over both input files line by line and write to the output file at the same time. This is file_a.txt:

>id1
THISISA-----
>id2
HELLO-------
>id3
TESTTESTTEST

This is file_b.txt:

>id1
-------TEST!
>id2
-----WORLD!!
>id3
TESTTESTTEST

Here is the code:

#!/usr/bin/env python3
def merge(file_a, file_b, file_c, gap='-'):

    with open(file_a) as fa, open(file_b) as fb, open(file_c, 'w') as fc:

        for line_a, line_b in zip(fa, fb):

            if line_a.startswith('>id'):
                fc.write(line_a)
                continue

            s = ''.join(a if a != gap else b for a, b in zip(line_a, line_b))
            fc.write(s)


if __name__ == '__main__':
    merge('file_a.txt', 'file_b.txt', 'file_c.txt')

This is the content of the resulting file_c.txt:

>id1
THISISATEST!
>id2
HELLOWORLD!!
>id3
TESTTESTTEST

Note with this approach you don't have to load the whole content of your files into the memory before processing. In the case your DNA files are really big, this will matter.

constt
  • 2,250
  • 1
  • 17
  • 18