1

I am trying to get the quotes and their respective authors in a long text.

Example : Paul […] Jane says G_quoted text_R

How can I get Jane and her quoted text in two groups but not Paul etc.

I tried some positive lookahead like this but I get all names, not just Jane. Many thank for your help.

i?(Paul|Jane|Robert|John)(?=[^.]*?G_(.*)_R)

https://regex101.com/r/mx0JgV/1

ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
  • Why lookahead? Are you required to only consume text up to "Jane" and no further, or "Jane" must be the match of the entire regex and not of a group, or some other weird requirement? – ivan_pozdeev May 30 '17 at 15:29
  • I can't understand well... If you just need "Jane", why do you add "Paul" and other names? And why your quoted text is not enclosed by (") but "G_" and "_R"? – Sraw May 30 '17 at 15:31
  • I want to get all quotes from the listed authors. In this example, it is Jane but it will be Paul, Robert etc. in other parts of text. "G_" and "_R" are initilaly html tags and but I converted to text – user3259111 May 30 '17 at 15:39
  • @ivan_pozdeev : I am not sure to undersand your question. I need to get all quotes and the names of their authors. Authors are always the name closest to the quote. Thanks. – user3259111 May 30 '17 at 15:45
  • Interesting. Lookbehind can't be used because Python's engine, like PCRE, [requires it to be of fixed width](https://stackoverflow.com/questions/3796436/whats-the-technical-reason-for-lookbehind-assertion-must-be-fixed-length-in-r). – ivan_pozdeev May 30 '17 at 16:53

1 Answers1

0

What's wrong with:

import re

QUOTE_FINDER = re.compile(r"(paul|jane|robert|john).*?G_(.*?)_R", re.IGNORECASE | re.DOTALL)

data = """dfdsf Jane […] Paul […] Jane says G_quoted text_R
and Paul says G_some other text_R while Robert prefers to say G_nothing_R..."""

quotes = QUOTE_FINDER.findall(data)
# [('Jane', 'quoted text'), ('Paul', 'some other text'), ('Robert', 'nothing')]
zwer
  • 24,943
  • 3
  • 48
  • 66