0

I just need to know HOW to search for TWO strings in a line of my file.

Example: I need the line to include both "protein_coding" and "exon". Then if it does include them I will print certain columns of each line. I know how to print them but cannot figure out how to search for TWO strings using reg ex. Thank you in advance.

is this correct?: if re.match("protein_coding" & "exon" in line:

dahlia
  • 283
  • 1
  • 5
  • 18
  • 1
    Please see http://stackoverflow.com/questions/24656131/regex-for-existience-of-some-words-whose-order-doesnt-matter/24656216#24656216 – Unihedron Jul 25 '14 at 14:46
  • I am hopeful this question is already asked and has an answer... Duplicate questions are not sign ... – Aditya Jul 25 '14 at 14:48

3 Answers3

3

This regex would match the lines which has both "protein_coding" & "exon" strings.

^.*?\bprotein_coding\b.*?\bexon\b.*$

DEMO

>>> import re
>>> data = """protein_coding exon foo bar
... foo
... protein_coding
... """
>>> m = re.findall(r'^.*?\bprotein_coding\b.*?\bexon\b.*$', data, re.M)
>>> for i in m:
...     print i
... 
protein_coding exon foo bar
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
3

If the test strings do not require the use of a regular expression, recall that you can use Python's string functions and in as well:

>>> line='protein_coding other stuff exon more stuff'
>>> "protein_coding" in line and "exon" in line
True

Or if you want to test an arbitrary number of words, use all and a tuple of targets words to test:

>>> line='protein_coding other stuff exon more stuff'
>>> all(s in line for s in ("protein_coding", "exon", "words"))
False
>>> all(s in line for s in ("protein_coding", "exon", "stuff"))
True

And if the matches are something that require a regex and you want to limit to multiple unrelated regexes, use all and a comprehension to test:

>>> p1=re.compile(r'\b[a-z]+_coding\b')
>>> p2=re.compile(r'\bexon\b')
>>> li=[p.search(line) for p in [p1, p2]]
>>> li
[<_sre.SRE_Match object at 0x10856d988>, <_sre.SRE_Match object at 0x10856d9f0>]
>>> all(e for e in li)
True 
dawg
  • 98,345
  • 23
  • 131
  • 206
1

Using anchors and lookahead assertions:

>>> re.findall(r'(?m)^(?=.*protein_coding)(?=.*exon).+$', data)

The inline (?m) modifier enables multi-line mode. The use of lookahead here matches both substrings regardless of the order they are in.

Live Demo

hwnd
  • 69,796
  • 4
  • 95
  • 132