2

This is a sample of the text I am working with.

6) Jake's Taxi Service is a new entrant to the taxi industry. It has achieved success by staking out a unique position in the industry. How did Jake's Taxi Service mostly likely achieve this position?

A) providing long-distance cab fares at a higher rate than competitors; servicing a larger area than competitors

B) providing long-distance cab fares at a lower rate than competitors; servicing a smaller area than competitors

C) providing long-distance cab fares at a higher rate than competitors; servicing the same area as competitors

D) providing long-distance cab fares at a lower rate than competitors; servicing the same area as competitors

Answer: D

I am trying to match the entire question including the answer options. Everything from the question number to the word Answer

This is my current regex expression

((rf'(?<={searchCounter}\) ).*?(?=Answer).*'), re.DOTALL)

SearchCounter is just a variable that will correspond to the current question, in this case 6. I think the issue is something to do with searching across the new lines.

EDIT: Full source code

searchCounter = 1

bookDict = {}

with open ('StratMasterKey.txt', 'rt') as myfile:

    for line in myfile:
        question_pattern = re.compile((rf'(?<={searchCounter}\) ).*?(?=Answer).*'), re.DOTALL) 

        result = question_pattern.search(line)
        if result != None: 
            bookDict[searchCounter] = result[0] 
            searchCounter +=1
Clayton Horning
  • 228
  • 2
  • 16
  • You actually get all text from the question number to the last `D`, see the [regex demo](https://regex101.com/r/qkBImf/2). How are you reading the file? `for line in file`? You need to read the file into a variable, like `contents = file.read()`. – Wiktor Stribiżew Apr 23 '20 at 14:49
  • I'm actually getting an empty dictionary when I run it in my project. Would you mind elaborating a little? I added my code. – Clayton Horning Apr 23 '20 at 15:01
  • When I try '(?<={searchCounter}\) ).*?(?=a)')' I get 'A good str' which seems to be working fine. It's only an issue when I try to span over the new line to the word Answer. @WiktorStribiżew – Clayton Horning Apr 23 '20 at 15:16
  • 1
    It is a logic problem: you have `for line in myfile:`, you read line by line, but your pattern is written to find matches in a single multiline string. Remove `for line in myfile:` and replace it with `contents = myfile.read()` then use `result = question_pattern.search(contents)` – Wiktor Stribiżew Apr 23 '20 at 15:16
  • Great! However, my only question is how do I iterate through the instances if I am not searching line by line? – Clayton Horning Apr 23 '20 at 15:26
  • `re.findall(rf'^{searchCounter}\)\s*([\s\S]*?)\nAnswer:\s*(.*)', contents, re.M)`? – Wiktor Stribiżew Apr 23 '20 at 15:29
  • Or, just them all, `re.findall(r'^(\d+)\)\s*([\s\S]*?)\nAnswer:\s*(.*)', contents, re.M)` – Wiktor Stribiżew Apr 23 '20 at 15:36
  • So what output do you need to get? – Wiktor Stribiżew Apr 23 '20 at 15:46
  • It's looking a lot better now. My end goal is to get the the whole question as well as the whole answer instead of just D, i.e. _D) providing long-distance cab fares at a lower rate than competitors; servicing the same area as competitors_ into a dictionary or list so that I can convert them into a csv. I have a file with over 1,000 questions. – Clayton Horning Apr 23 '20 at 16:00
  • So what are the specs? "A lot better" is the maximum I can do with your example. I hope you are not going to post all 1000 examples here. – Wiktor Stribiżew Apr 23 '20 at 16:02
  • You solved my question regarding the multiline issue. You asked what output do I need to get which I guess is out of the scope of this question :) – Clayton Horning Apr 23 '20 at 16:06
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/212374/discussion-between-clayton-horning-and-wiktor-stribizew). – Clayton Horning Apr 23 '20 at 17:19

1 Answers1

1

The reason your regex fails is that you read the file line by line with for line in myfile:, while your pattern searches for matches in a single multiline string.

Replace for line in myfile: with contents = myfile.read() and then use result = question_pattern.search(contents) to get the first match, or result = question_pattern.findall(contents) to get multiple matches.

A note on the regex: I am not fixing the whole pattern since, as you mentioned, it is out of scope of this question, but since the string input is a multiline string now, you need to remove re.DOTALL and use [\s\S] to match any char in the pattern and . to match any char but a line break char. Also, the lookaround contruct is redundant, you may safely replace (?=Answer) with Answer. Also, to check if there is a match, you may simply use if result: and then grab the whole match value by accessing result.group().

Full code snippet:

with open ('StratMasterKey.txt', 'rt') as myfile:
    contents = myfile.read()
    question_pattern = re.compile((rf'(?<={searchCounter}\) )[\s\S]*?Answer.*')) 
    result = question_pattern.search(contents)
    if result: 
        print( result.group() )
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563