-1

I have two separate files A and B of more than 100000 lines. Now I need to compare each line to find if they are in A but not in B. Both files are in text format.

File A:

>Q63544|9
----------------------MDVFKKGFSIAREGVVGAVEKTKQGVTEAAEKTKEGVMY
>Q63544|51
KTKQGVTEAAEKTKEGVMYVGTKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKT
>Q63544|54
QGVTEAAEKTKEGVMYVGTKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEE
>Q63544|67
VMYVGTKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEEAENIVVTTGVVRK
>Q63544|72
TKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEEAENIVVTTGVVRKEDLEP
>Q63544|73
KTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEEAENIVVTTGVVRKEDLEPP

File B:

>Q63544|51
KTKQGVTEAAEKTKEGVMYVGTKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKT
>Q63544|54
QGVTEAAEKTKEGVMYVGTKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEE
>Q63544|67
VMYVGTKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEEAENIVVTTGVVRK
>Q63544|73
KTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEEAENIVVTTGVVRKEDLEPP

What I need: A-B

>Q63544|9
----------------------MDVFKKGFSIAREGVVGAVEKTKQGVTEAAEKTKEGVMY 
>Q63544|72
TKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEEAENIVVTTGVVRKEDLEP

Any help or suggestion would be appreciated.

psuresh
  • 535
  • 2
  • 13
  • 25
  • Also check [can someone help me?](https://meta.stackoverflow.com/questions/284236/why-is-can-someone-help-me-not-an-actual-question). This suggests a range of needs too broad for Stack Overflow. – Prune Mar 14 '21 at 06:21
  • I might approach this problem using a trie data structure. Load the data from one of the files into the trie and then for each entry in the second file, search the trie to see if the item exists. You could put descriptive metadata in each end node. https://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python – djhallx Mar 17 '21 at 01:49

1 Answers1

1

You can try regular expression

import re

# here I am taking as text but u can read the file like text_1 = open('file_1.txt').read()
text_1 = """>Q63544|9
----------------------MDVFKKGFSIAREGVVGAVEKTKQGVTEAAEKTKEGVMY
>Q63544|51
KTKQGVTEAAEKTKEGVMYVGTKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKT
>Q63544|54
QGVTEAAEKTKEGVMYVGTKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEE
>Q63544|67
VMYVGTKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEEAENIVVTTGVVRK
>Q63544|72
TKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEEAENIVVTTGVVRKEDLEP
>Q63544|73
KTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEEAENIVVTTGVVRKEDLEPP"""

text_2 = """>Q63544|51
KTKQGVTEAAEKTKEGVMYVGTKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKT
>Q63544|54
QGVTEAAEKTKEGVMYVGTKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEE
>Q63544|67
VMYVGTKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEEAENIVVTTGVVRK
>Q63544|73
KTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEEAENIVVTTGVVRKEDLEPP
"""

protein_1 = {i[0]:i[1] for i in re.findall(r'^>Q63544|(\d+)\n(.*)', text_1, re.MULTILINE) if len(i[0])>0}
protein_2 = {i[0]:i[1] for i in re.findall(r'^>Q63544|(\d+)\n(.*)', text_2, re.MULTILINE) if len(i[0])>0}

dissimilar_protein = {i:protein_1[i] for i in protein_1 if i not in protein_2}

print(list(dissimilar_protein.values()))
['----------------------MDVFKKGFSIAREGVVGAVEKTKQGVTEAAEKTKEGVMY', 'TKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEEAENIVVTTGVVRKEDLEP']

Or if you want that specific format

output = [f'>Q63544|{i}\n{j}' for i,j in dissimilar_protein.items()]
for i in output:
    print(i)
>Q63544|9
----------------------MDVFKKGFSIAREGVVGAVEKTKQGVTEAAEKTKEGVMY
>Q63544|72
TKTKGERGTSVTSVAEKTKEQANAVSEAVVSSVNTVATKTVEEAENIVVTTGVVRKEDLEP

NOTR: I am assuming all protein number starts with Q63544

Epsi95
  • 8,832
  • 1
  • 16
  • 34