Unable to fetch multiline line text with Regex

Question

I have scanned a PDF with Tika which contains the text in the following format, having multiple line breaks

Some non Interview text

interview with Mr.XYZ



Question: How are you?
Answer: I am fine.

Question: What do you do?
Answer: Nothing



Some non Interview text

How do I apply regex?I can match words and spaces but it is not going multiline. I tried the following regex:

https://regex101.com/r/sekUyT/1

What all I want is interview related text which starts with interview with and is considered end when the text does not contain any more Question: and Answer:

It work's fine ```re.search("interview with \s?\w+.\w+", text).group()``` — sushanth, Jul 16 '20 at 08:36
I'm not sure I understand. Do you want to match everything starting from "interview with" and until the end? Something like `interview with[\s\S]*` should do the job. — 41686d6564 stands w. Palestine, Jul 16 '20 at 08:42
@Sushanth I just tried the python code generated and it returned nothing — Volatil3, Jul 16 '20 at 08:44
@AhmedAbdelhameed not the end, the interview will be considered _end_ if there is no more text like _question_ and _answer_ . And your Regex is not working, I tired on Regex101 site — Volatil3, Jul 16 '20 at 08:44
So, what did you try to make your regex stop at either `question` or `answer`? See [Regex Match all characters between two strings](https://stackoverflow.com/questions/6109882/) — Wiktor Stribiżew, Jul 16 '20 at 08:45
@WiktorStribiżew Right now I am stuck that it is not fetching multiline text after _interview with_. So far could not figure out how to stop once it does not find `question` or `answer`. — Volatil3, Jul 16 '20 at 08:47
I'm still not sure what exactly you're trying to achieve but maybe try something like `interview with.+(?:\s+Question:.+\s+Answer:.+)*`. Demo: https://regex101.com/r/jNVigs/1 — 41686d6564 stands w. Palestine, Jul 16 '20 at 08:49
@AhmedAbdelhameed It is menitoned that I want to fetch all text starts with _Interview with_ and ends when there is no more the word "Answer" in it. — Volatil3, Jul 16 '20 at 08:51
@AhmedAbdelhameed the Regex you replied is the one I was looking for. — Volatil3, Jul 16 '20 at 08:52
@AhmedAbdelhameed can you tell what 's wrong was I doing? what does it mean by `:?`? — Volatil3, Jul 16 '20 at 08:57
@Volatil3 It's called a non-capturing group (see [this post](https://stackoverflow.com/q/3512471/8967612)). You might also want to spend some time in the [Regex reference](https://stackoverflow.com/q/22937618/8967612) because the pattern that you used was very unrelated to the requirements that you (kind of) described and that's why people were confused. You'll find a lot of resources there. Good luck :) — 41686d6564 stands w. Palestine, Jul 16 '20 at 09:01
@AhmedAbdelhameed Thanks and JazakAllah but what do we use for line breaks? PDF could be messy and there could be unnecessary line breaks after _Mr.XYZ_. How can I make sure it covers both line break and non line break case because there could be a line break after _interview with_. — Volatil3, Jul 16 '20 at 09:10

score 0 · Answer 1 · answered Jul 16 '20 at 08:50

Use the re.findall funtion to get all the occurances of a particular text.

match = re.findall('interview with \s*?\w+.\w+',text)

match is a list of occurences of the matched text, if you only want the names, use : 'interview with \s*?(\w+.\w+)' as the search string.

Unable to fetch multiline line text with Regex

1 Answers1