-1

I have a huge file that looks like this:

CAV-1 ATCTACTTCTATCG
CAV-2 GCGCGTAGCTAGCT
CAV-2 AAGCGCTCGTAAAA
CAV-3 AAATATATATATCC

Using Python, I want to delete the lines having a duplicate string, in this case "CAV-2". The first line having the string would remain. I would get this:

CAV-1 ATCTACTTCTATCG
CAV-2 GCGCGTAGCTAGCT
CAV-3 AAATATATATATCC

I know how to use regex and to parse through lines, but I am not able to do this specific task.

I know how to use

Lucas
  • 1,139
  • 3
  • 11
  • 23
  • is your file always sorted? – RomanPerekhrest Sep 26 '17 at 15:28
  • split lines, put the first part as key in a dict and check for each line if the first part is in already a dict key. – Casimir et Hippolyte Sep 26 '17 at 15:30
  • Hi @Psidom, I don't want to delete duplicate lines, I want to delete lines containing a duplicate regex. This is the function that is familiar to me, but other alternatives are welcome. – Lucas Sep 26 '17 at 15:33
  • If the list can be unordered, or may be sorted later, you could use a set, see [here](https://stackoverflow.com/questions/7961363/removing-duplicates-in-lists) – RolfBly Sep 26 '17 at 15:38
  • If it were sorted list within the string, a solution could be _Find_ `(?m)^((\S+)(?=\s).*\r?\n)\s*^\2(?=\s).*(?:\r?\n)?` _Replace_ `$1` https://regex101.com/r/X9HLww/1 –  Sep 26 '17 at 15:53
  • Or, a multiple version https://regex101.com/r/X9HLww/2 –  Sep 26 '17 at 16:02

3 Answers3

3

Just use a dictionary

In [1]: lines = '''CAV-1 ATCTACTTCTATCG
   ...: CAV-2 GCGCGTAGCTAGCT
   ...: CAV-2 AAGCGCTCGTAAAA
   ...: CAV-3 AAATATATATATCC'''

In [2]: lines
Out[2]: 'CAV-1 ATCTACTTCTATCG\nCAV-2 GCGCGTAGCTAGCT\nCAV-2 AAGCGCTCGTAAAA\nCAV-3 AAATATATATATCC'

In [3]: res = {}

In [4]: for line in lines.split("\n"):
   ...:         res[line.split(" ")[0]] = line.split(" ")[1]
   ...:  

In [5]: res
Out[5]: 
{'CAV-1': 'ATCTACTTCTATCG',
 'CAV-2': 'AAGCGCTCGTAAAA',
 'CAV-3': 'AAATATATATATCC'}

In [6]: '\n'.join(['%s %s' % (key, value) for (key, value) in res.items()])
Out[6]: 'CAV-1 ATCTACTTCTATCG\nCAV-2 AAGCGCTCGTAAAA\nCAV-3 AAATATATATATCC'

If you want to preserve the first line you can use a dictionary of lists and then output the last element

1

You will have to use capturing groups like this.

Regex: ((CAV-\d\s)[AGCT]+)(?:\n\2[AGCT]+)*

Explanation:

  1. ((CAV-\d\s)[AGCT]+) checks for your pattern and captures whole match. Sub-match CAV-\d\s is captured in 2nd capturing group.

  2. (?:\n\2[AGCT]+)* checks for more than one occurrence with subpattern CAV-\d\s inside it.

  3. Finally replace whole match with 1st captured group i.e your first pattern.

Regex101 Demo

Python Code ( tested in Python 3.5.2 )

import re

# Open file having genetic code. Use your file path.
new1 = 'C:\\Users\\acer\\Desktop\\new1.txt'

# Create a new file for replaced data. Use your file path.
new2 = 'C:\\Users\\acer\\Desktop\\new2.txt'

fp1 = open( new1, 'r') # Opening original file in read mode
fp2 = open( new2, 'w') # Opening replaced data in write mode.

lines = fp1.readlines() # Reading data from original file.
lines = ''.join(lines)  # Joined all lines as one line.

# Regex substitution on joined lines. Will repalce the duplicate data.
lines = re.sub(r'((CAV-\d+\s)[AGCT]+)(?:\n\2[AGCT]+)*', r'\1', lines)

#Writing replaced data to new file.

fp2.write(lines)

# Closing files.

fp1.close()
fp2.close()
Rahul
  • 2,658
  • 12
  • 28
  • Thank you @Rahul. Could you perhaps explain how you would incorporate that regex in the context of my question?. Thanks very much – Lucas Sep 26 '17 at 15:38
1

As other users have pointed out, regex is not the best technique for this problem. You can use a dictionary, then remove duplicates:

from collections import defaultdict
d = defaultdict(list)
s = ["CAV-1 ATCTACTTCTATCG", "CAV-2 GCGCGTAGCTAGCT", "CAV-2 AAGCGCTCGTAAAA", "CAV-3 AAATATATATATCC"]
for name, sequence in [i.split() for i in s]:
   d[name].append(sequence)
final_output = [' '.join([a, b[0]]) for a, b in d.items()]

Output:

['CAV-1 ATCTACTTCTATCG', 'CAV-2 GCGCGTAGCTAGCT', 'CAV-3 AAATATATATATCC']
Ajax1234
  • 69,937
  • 8
  • 61
  • 102