-3

I have scanned a PDF with Tika which contains the text in the following format, having multiple line breaks

Some non Interview text

interview with Mr.XYZ



Question: How are you?
Answer: I am fine.

Question: What do you do?
Answer: Nothing



Some non Interview text

How do I apply regex?I can match words and spaces but it is not going multiline. I tried the following regex:

https://regex101.com/r/sekUyT/1

What all I want is interview related text which starts with interview with and is considered end when the text does not contain any more Question: and Answer:

Volatil3
  • 14,253
  • 38
  • 134
  • 263
  • It work's fine ```re.search("interview with \s?\w+.\w+", text).group()``` – sushanth Jul 16 '20 at 08:36
  • I'm not sure I understand. Do you want to match everything starting from "interview with" and until the end? Something like `interview with[\s\S]*` should do the job. – 41686d6564 stands w. Palestine Jul 16 '20 at 08:42
  • @Sushanth I just tried the python code generated and it returned nothing – Volatil3 Jul 16 '20 at 08:44
  • @AhmedAbdelhameed not the end, the interview will be considered _end_ if there is no more text like _question_ and _answer_ . And your Regex is not working, I tired on Regex101 site – Volatil3 Jul 16 '20 at 08:44
  • So, what did you try to make your regex stop at either `question` or `answer`? See [Regex Match all characters between two strings](https://stackoverflow.com/questions/6109882/) – Wiktor Stribiżew Jul 16 '20 at 08:45
  • @WiktorStribiżew Right now I am stuck that it is not fetching multiline text after _interview with_. So far could not figure out how to stop once it does not find `question` or `answer`. – Volatil3 Jul 16 '20 at 08:47
  • 1
    I'm still not sure what exactly you're trying to achieve but maybe try something like `interview with.+(?:\s+Question:.+\s+Answer:.+)*`. Demo: https://regex101.com/r/jNVigs/1 – 41686d6564 stands w. Palestine Jul 16 '20 at 08:49
  • @AhmedAbdelhameed It is menitoned that I want to fetch all text starts with _Interview with_ and ends when there is no more the word "Answer" in it. – Volatil3 Jul 16 '20 at 08:51
  • @AhmedAbdelhameed the Regex you replied is the one I was looking for. – Volatil3 Jul 16 '20 at 08:52
  • @AhmedAbdelhameed can you tell what 's wrong was I doing? what does it mean by `:?`? – Volatil3 Jul 16 '20 at 08:57
  • @Volatil3 It's called a non-capturing group (see [this post](https://stackoverflow.com/q/3512471/8967612)). You might also want to spend some time in the [Regex reference](https://stackoverflow.com/q/22937618/8967612) because the pattern that you used was very unrelated to the requirements that you (kind of) described and that's why people were confused. You'll find a lot of resources there. Good luck :) – 41686d6564 stands w. Palestine Jul 16 '20 at 09:01
  • @AhmedAbdelhameed Thanks and JazakAllah but what do we use for line breaks? PDF could be messy and there could be unnecessary line breaks after _Mr.XYZ_. How can I make sure it covers both line break and non line break case because there could be a line break after _interview with_. – Volatil3 Jul 16 '20 at 09:10

1 Answers1

0

Use the re.findall funtion to get all the occurances of a particular text.

match = re.findall('interview with \s*?\w+.\w+',text)

match is a list of occurences of the matched text, if you only want the names, use : 'interview with \s*?(\w+.\w+)' as the search string.

Roshin Raphel
  • 2,612
  • 4
  • 22
  • 40