2

So I am trying to grab the string from a BibTex using regex in python. Here is part of my string:

a = '''title = {The Origin ({S},
        {Se}, and {Te})- {TiO$_2$} Photocatalysts},
   year = {2010},
   volume = {114},'''

I want to grab the string for the title, which is:

The Origin ({S},
        {Se}, and {Te})- {TiO$_2$} Photocatalysts

I currently have this code:

pattern = re.compile('title\s*=\s*{(.*|\n?)},\s*\n', re.DOTALL|re.I)
pattern.findall(a)

But it only gives me:

['The Origin ({S},\n            {Se}, and {Te})- {TiO$_2$} Photocatalysts},\n       year = {2010']

How can I get the whole title string without the year information? Many times, year is not right after title. So I cannot use:

pattern = re.compile('title\s*=\s*{(.*|\n?)},\s*\n.*year', re.DOTALL|re.I)
pattern.findall(a)
Jianli Cheng
  • 371
  • 1
  • 8
  • 17

4 Answers4

1

A quick solution would be to modify your regex pattern

pattern = re.compile('title\s*=\s*{(.*|\n?)},\s*\n', re.DOTALL|re.I)
mic4ael
  • 7,974
  • 3
  • 29
  • 42
1

Depends on how general you want your regex to be. I guess you want your string to be able to contain { and }, so using that to mark the ending of the pattern will cause issues. Also there could be multiple brackets.

Here's an idea, what if you look for the word year at the end of the regex, assuming that's constant.

pattern = re.compile('title\s*=\s*{(.*?)},\s*\n\s*year', re.DOTALL|re.I)
Saad Khan
  • 316
  • 1
  • 4
1

Use the newer regex module:

import regex as re

rx = re.compile(r'''
        (?(DEFINE)
            (?<part>\w+\ =\ \{)
            (?<end>\},)
            (?<title>title\ =\ \{)
        )
        (?&title)(?P<t>(?:(?!(?&part))[\s\S])+)(?&end)
    ''', re.VERBOSE)

string = '''
title = {The Origin ({S},
        {Se}, and {Te})- {TiO$_2$} Photocatalysts},
   year = {2010},
   volume = {114},
'''

title = rx.search(string).group('t')
print(title)
# The Origin ({S},
#    {Se}, and {Te})- {TiO$_2$} Photocatalysts

Though it is not really needed, it provides an alternative solution.

Jan
  • 42,290
  • 8
  • 54
  • 79
0

textwrap can be useful:

import textwrap

a = '''title = {The Origin ({S},
        {Se}, and {Te})- {TiO$_2$} Photocatalysts},
   year = {2010},
   volume = {114},'''

indent = "   "
print(textwrap.dedent(indent + a))
Laurent LAPORTE
  • 21,958
  • 6
  • 58
  • 103