153

I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is (\n is a newline)

some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).

I'd like to capture two things:

  • the some Varying TEXT part
  • all lines of uppercase text that come two lines below it in one capture (I can strip out the newline characters later).

I've tried a few approaches:

re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines

...and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text. I'd like match.group(1) to be some Varying Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.

If anyone's curious, it's supposed to be a sequence of amino acids that make up a protein.

Carolus
  • 477
  • 4
  • 16
Jan
  • 4,366
  • 6
  • 22
  • 21
  • Is there something else in the file besides the first line and the uppercase text? I'm not sure why you would use a regex instead of splitting all the text at newline characters and taking the first element as "some_Varying_TEXT". – UncleZeiv Feb 25 '09 at 19:20
  • 2
    yes, regex are the wrong tool for this. –  Feb 25 '09 at 20:25
  • Your sample text doesn't have a leading `>` character. Should it? – MiniQuark Feb 25 '09 at 20:39

7 Answers7

158

Try this:

re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)

I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.

Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:

re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)

BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • You may want to replace the second dot in the regex by [A-Z] if you don't want this regular expression to match just about any text file with an empty second line. ;-) – MiniQuark Feb 25 '09 at 20:36
  • My impression is that the target files will conform to a definite (and repeating) pattern of empty vs. non-empty lines, so it shouldn't be necessary to specify [A-Z], but it probably won't hurt, either. – Alan Moore Feb 25 '09 at 21:13
  • This solution worked beautifully. As an aside, I apologize, since I obviously didn't clarify the situation enough (and also for the lateness of this reply). Thanks for your help! – Jan Mar 03 '09 at 22:18
  • At least newer [Python documentation](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files) says that platform-specific line endings are abstracted away to `\n` both when reading and writing. So you can pretend that non-Unix platforms don't exist . – Carolus Mar 14 '23 at 19:26
31

This will work:

>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
...   title, sequence = match.groups()
...   title = title.strip()
...   sequence = rx_blanks.sub("",sequence)
...   print "Title:",title
...   print "Sequence:",sequence
...   print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK

Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW

Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)

  • The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
  • Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
  • [A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
  • ((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
  • You could add a final \n in the regular expression if you want to enforce a double newline at the end.
  • Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).
MiniQuark
  • 46,633
  • 36
  • 147
  • 183
  • 1
    match() only returns one match, at the very beginning of the target text, but the OP said there would be hundreds of matches per file. I think you would want finditer() instead. – Alan Moore Feb 25 '09 at 21:24
25

The following is a regular expression matching a multiline block of text:

import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)
Grant Miller
  • 27,532
  • 16
  • 147
  • 165
Punnerud
  • 7,195
  • 2
  • 54
  • 44
5

If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:

def read_amino_acid_sequence(path):
    with open(path) as sequence_file:
        title = sequence_file.readline() # read 1st line
        aminoacid_sequence = sequence_file.read() # read the rest

    # some cleanup, if necessary
    title = title.strip() # remove trailing white spaces and newline
    aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
    return title, aminoacid_sequence
MiniQuark
  • 46,633
  • 36
  • 147
  • 183
  • Definitively the easiest way if there was only one, and its also workable with more, if some more logic is added. There's about 885 proteins in this specific dataset though, and I felt that a regex should be able to handle this. – Jan Mar 03 '09 at 22:17
4

find:

^>([^\n\r]+)[\n\r]([A-Z\n\r]+)

\1 = some_varying_text

\2 = lines of all CAPS

Edit (proof that this works):

text = """> some_Varying_TEXT

DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA

> some_Varying_TEXT2

DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""

import re

regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]
#NOTE can be sorter with matches = re.findall(pattern, text, re.MULTILINE)

for m in matches:
    print 'Name: %s\nSequence:%s' % (m[0], m[1])
Nam G VU
  • 33,193
  • 69
  • 233
  • 372
Jason Coon
  • 17,601
  • 10
  • 42
  • 50
4

It can sometimes be comfortable to specify the flag directly inside the string, as an inline-flag:

"(?m)^A complete line$".

For example in unit tests, with assertRaisesRegex. That way, you don't need to import re, or compile your regex before calling the assert.

Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
1

My preference.

lineIter= iter(aFile)
for line in lineIter:
    if line.startswith( ">" ):
         someVaryingText= line
         break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
    if len(line.strip()) == 0:
        break
    acids.append( line )

At this point you have someVaryingText as a string, and the acids as a list of strings. You can do "".join( acids ) to make a single string.

I find this less frustrating (and more flexible) than multiline regexes.

S.Lott
  • 384,516
  • 81
  • 508
  • 779