python regex find match that spans multiple lines

Question

So I am trying to grab the string from a BibTex using regex in python. Here is part of my string:

a = '''title = {The Origin ({S},
        {Se}, and {Te})- {TiO$_2$} Photocatalysts},
   year = {2010},
   volume = {114},'''

I want to grab the string for the title, which is:

The Origin ({S},
        {Se}, and {Te})- {TiO$_2$} Photocatalysts

I currently have this code:

pattern = re.compile('title\s*=\s*{(.*|\n?)},\s*\n', re.DOTALL|re.I)
pattern.findall(a)

But it only gives me:

['The Origin ({S},\n            {Se}, and {Te})- {TiO$_2$} Photocatalysts},\n       year = {2010']

How can I get the whole title string without the year information? Many times, year is not right after title. So I cannot use:

pattern = re.compile('title\s*=\s*{(.*|\n?)},\s*\n.*year', re.DOTALL|re.I)
pattern.findall(a)

Possible duplicate of http://stackoverflow.com/questions/587345/python-regular-expression-matching-a-multiline-block-of-text — Dinesh Pundkar, Aug 19 '16 at 16:31

score 1 · Answer 1 · answered Aug 19 '16 at 16:31

1

A quick solution would be to modify your regex pattern

pattern = re.compile('title\s*=\s*{(.*|\n?)},\s*\n', re.DOTALL|re.I)

answered Aug 19 '16 at 16:31

mic4ael

7,974
3
29
42

I just found out this is wrong. It will grab the `year` line as well – Jianli Cheng Aug 19 '16 at 17:40

score 1 · Answer 2 · answered Aug 19 '16 at 16:33

1

Depends on how general you want your regex to be. I guess you want your string to be able to contain { and }, so using that to mark the ending of the pattern will cause issues. Also there could be multiple brackets.

Here's an idea, what if you look for the word year at the end of the regex, assuming that's constant.

pattern = re.compile('title\s*=\s*{(.*?)},\s*\n\s*year', re.DOTALL|re.I)

answered Aug 19 '16 at 16:33

Saad Khan

316
1
4

Many times the `year` is not after `title`. But you still give me a new idea of doing this :) – Jianli Cheng Aug 19 '16 at 16:37

score 1 · Answer 3 · answered Aug 19 '16 at 18:15

Use the newer regex module:

import regex as re

rx = re.compile(r'''
        (?(DEFINE)
            (?<part>\w+\ =\ \{)
            (?<end>\},)
            (?<title>title\ =\ \{)
        )
        (?&title)(?P<t>(?:(?!(?&part))[\s\S])+)(?&end)
    ''', re.VERBOSE)

string = '''
title = {The Origin ({S},
        {Se}, and {Te})- {TiO$_2$} Photocatalysts},
   year = {2010},
   volume = {114},
'''

title = rx.search(string).group('t')
print(title)
# The Origin ({S},
#    {Se}, and {Te})- {TiO$_2$} Photocatalysts

Though it is not really needed, it provides an alternative solution.

score 0 · Answer 4 · answered Aug 19 '16 at 16:38

0

textwrap can be useful:

import textwrap

a = '''title = {The Origin ({S},
        {Se}, and {Te})- {TiO$_2$} Photocatalysts},
   year = {2010},
   volume = {114},'''

indent = "   "
print(textwrap.dedent(indent + a))

answered Aug 19 '16 at 16:38

Laurent LAPORTE

21,958
6
58
103

python regex find match that spans multiple lines

4 Answers4