-3

I have a string and I want to extract a sub-string comprising of the following characters [A,T,C,G,\n] only. This characters can appear in the sub-string in any order and number without a specific pattern. I also don't have any constant delimiter before and after this sub-string that I can use. Example of a full string and the sub-string I would like to extract in BOLD.

  • AC068547.7 Homo sapiens BAC clone RP11-458J7 from 2, complete sequence GAATTCAACTTTCTAGACCAATGATTTTTGGACTAATGATGTTTGGAGGGCCCAACAACCCAGAAAGTTGAATTCCAGTC\nTCCTTTAGTGAAAATAAA\n

  • AC1284347.7 Homo sapiens XXX clone RP11-1238J7 from 3,CDSTAGGGCTGAGATCGGCGTAAG\nGAGATCGGAGAGCTGAAT

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Nastya
  • 71
  • 4
  • 1
    What did you already try? What concrete issues are you facing? Thanks for considering [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) and [How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). – Christian Baumann Sep 26 '20 at 10:50
  • Why shouldn't "AC" be matched in "AC0...", "BAC", and "AC1..."? The question is not clear – 41686d6564 stands w. Palestine Sep 26 '20 at 10:51
  • Sorry for not being clear enough, and you are right AC matched the query which is not good enough. I can add a condition that the length of the sub-string must be at least 7 characters – Nastya Sep 26 '20 at 11:02
  • 1
    Does this answer your question? [Learning Regular Expressions](https://stackoverflow.com/questions/4736/learning-regular-expressions) – mkrieger1 Sep 26 '20 at 13:35

1 Answers1

-2

You can use a regular expression to find all sequences consisting only of the given characters, and then select the longest match with max():

import re

example = 'AC1284347.7 Homo sapiens XXX clone RP11-1238J7 from 3,CDSTAGGGCTGAGATCGGCGTAAG\nGAGATCGGAGAGCTGAAT'
pattern = '[ACGT\n]*'

max(re.findall(pattern, example))

'TAGGGCTGAGATCGGCGTAAG\nGAGATCGGAGAGCTGAAT'

In case the string may contain several sequences of interest, you can use a list comprehension to return only those of a certain length:

[seq for seq in re.findall(pattern, example) if len(seq) >= 7]

['TAGGGCTGAGATCGGCGTAAG\nGAGATCGGAGAGCTGAAT']

Arne
  • 9,990
  • 2
  • 18
  • 28