Regular expression matching a multiline block of text

Question

I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is (\n is a newline)

some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).

I'd like to capture two things:

the some Varying TEXT part
all lines of uppercase text that come two lines below it in one capture (I can strip out the newline characters later).

I've tried a few approaches:

re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines

...and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text. I'd like match.group(1) to be some Varying Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.

If anyone's curious, it's supposed to be a sequence of amino acids that make up a protein.

Is there something else in the file besides the first line and the uppercase text? I'm not sure why you would use a regex instead of splitting all the text at newline characters and taking the first element as "some_Varying_TEXT". — UncleZeiv, Feb 25 '09 at 19:20
Your sample text doesn't have a leading `>` character. Should it? — MiniQuark, Feb 25 '09 at 20:39

Alan Moore · Accepted Answer · 2021-01-19T23:36:28.450

158

Try this:

re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)

I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.

Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:

re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)

BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.

edited Jan 19 '21 at 23:36

answered Feb 25 '09 at 20:06

Alan Moore

73,866
12
100
156

You may want to replace the second dot in the regex by [A-Z] if you don't want this regular expression to match just about any text file with an empty second line. ;-) – MiniQuark Feb 25 '09 at 20:36
My impression is that the target files will conform to a definite (and repeating) pattern of empty vs. non-empty lines, so it shouldn't be necessary to specify [A-Z], but it probably won't hurt, either. – Alan Moore Feb 25 '09 at 21:13
This solution worked beautifully. As an aside, I apologize, since I obviously didn't clarify the situation enough (and also for the lateness of this reply). Thanks for your help! – Jan Mar 03 '09 at 22:18
At least newer [Python documentation](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files) says that platform-specific line endings are abstracted away to `\n` both when reading and writing. So you can pretend that non-Unix platforms don't exist . – Carolus Mar 14 '23 at 19:26

MiniQuark · Answer 2 · 2009-02-26T11:03:39.303

This will work:

>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
...   title, sequence = match.groups()
...   title = title.strip()
...   sequence = rx_blanks.sub("",sequence)
...   print "Title:",title
...   print "Sequence:",sequence
...   print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK

Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW

Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)

The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
[A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
You could add a final \n in the regular expression if you want to enforce a double newline at the end.
Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).

match() only returns one match, at the very beginning of the target text, but the OP said there would be hundreds of matches per file. I think you would want finditer() instead. — Alan Moore, Feb 25 '09 at 21:24

score 25 · Answer 3 · edited Sep 15 '18 at 22:25

25

The following is a regular expression matching a multiline block of text:

import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)

edited Sep 15 '18 at 22:25

Grant Miller

27,532
16
147
165

answered Sep 15 '18 at 18:57

Punnerud

7,195
2
54
44

1

This is the best, most direct answer, IMHO. – pauljohn32 Mar 25 '21 at 15:43
1

this is a great answer- you may have to modify if you need to span multiple linebreaks in a row `\n\n` – grantr Mar 12 '22 at 16:33

score 5 · Answer 4 · answered Feb 25 '09 at 20:59

If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:

def read_amino_acid_sequence(path):
    with open(path) as sequence_file:
        title = sequence_file.readline() # read 1st line
        aminoacid_sequence = sequence_file.read() # read the rest

    # some cleanup, if necessary
    title = title.strip() # remove trailing white spaces and newline
    aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
    return title, aminoacid_sequence

Definitively the easiest way if there was only one, and its also workable with more, if some more logic is added. There's about 885 proteins in this specific dataset though, and I felt that a regex should be able to handle this. — Jan, Mar 03 '09 at 22:17

score 4 · Answer 5 · edited Oct 08 '22 at 08:20

4

find:

^>([^\n\r]+)[\n\r]([A-Z\n\r]+)

\1 = some_varying_text

\2 = lines of all CAPS

Edit (proof that this works):

text = """> some_Varying_TEXT

DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA

> some_Varying_TEXT2

DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""

import re

regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]
#NOTE can be sorter with matches = re.findall(pattern, text, re.MULTILINE)

for m in matches:
    print 'Name: %s\nSequence:%s' % (m[0], m[1])

edited Oct 08 '22 at 08:20

Nam G VU

33,193
69
233
372

answered Feb 25 '09 at 19:11

Jason Coon

17,601
10
42
50

Unfortunately, this regular expression will also match groups of capital letters separated by empty lines. It might not be a big deal though. – MiniQuark Feb 25 '09 at 20:27
Looks like coonj likes FASTA files. ;) – Andrew Dalke Feb 26 '09 at 13:21

score 4 · Answer 6 · answered May 31 '22 at 13:37

It can sometimes be comfortable to specify the flag directly inside the string, as an inline-flag:

"(?m)^A complete line$".

For example in unit tests, with assertRaisesRegex. That way, you don't need to import re, or compile your regex before calling the assert.

score 1 · Answer 7 · answered Feb 25 '09 at 20:58

My preference.

lineIter= iter(aFile)
for line in lineIter:
    if line.startswith( ">" ):
         someVaryingText= line
         break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
    if len(line.strip()) == 0:
        break
    acids.append( line )

At this point you have someVaryingText as a string, and the acids as a list of strings. You can do "".join( acids ) to make a single string.

I find this less frustrating (and more flexible) than multiline regexes.

Regular expression matching a multiline block of text

7 Answers7

Linked

Related