0

I am trying to sort out specific paragraph by using regular expression in python.

here is an input.txt file.

some random texts (100+ lines)
bbb
...
ttt
some random texts
ccc
...
fff    
paragraph_a A_story(

...
some random texts adfsasdsd

...
)

paragraph_b different_story(
...
some random texts
...
)

expected output is here:

some random texts (100+ lines)
bbb
...
ttt
some random texts
ccc
...
fff    

paragraph_b different_story(
...
some random texts
...
)

What I want to do is to delete all the paragraph_a contents (including parenthesis) but It should be deleted by the name of the below-line paragraph(in this case, paragraph_b) because the contents of the to-be-deleted paragraph(in this case, paragraph_a) is random.

I've managed to make regular expression to select Only the paragraph that is located right above paragraph_b

https://regex101.com/r/pwGVbe/1 <- you can refer to it in here.

However, By using this regular expression I couldn't delete the thing I want.

here is what I've done so far:

import re

output = open ('output.txt', 'w')
input = open('input.txt', 'r')

for line in input:
#    print(line)
    t = re.sub('^(\w+ \w+\((?:(.|\n)*)\))\s*^paragraph_b','', line)
    output.write(t)

Is there anything I can get some solution or clue? Any answer or advice would be appreciated.

Thanks.

Parine
  • 71
  • 6
  • 1
    If your regex successfully matches paragraph_a content, then what's missing? You're not being very clear about your goal and what's lacking in your current solution. – PookyFan Aug 21 '22 at 13:56
  • please add expected output and actual output to the question – rok Aug 21 '22 at 13:59
  • @PookyFan As I mentioned in the question, even though the regex itself matched, the code didn't work.. – Parine Aug 21 '22 at 13:59
  • @rok I added the desired output and the current output from the code is blank even though the regular expression seems to be matched... so That's why I question about the code.. – Parine Aug 21 '22 at 14:04
  • @Parine I understand now, see my answer. Does it help? – PookyFan Aug 21 '22 at 14:16

2 Answers2

1

You can match the paragraph before by asserting paragraph_b and not cross more paragraphs.

Note that input is a reserved keyword, so instead of writing input = open('input.txt', 'r') you might write it like this input_file = open('file', 'r')

 ^\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)

Regex demo

If the match also should not start with paragraph_b itself:

^(?!paragraph_b)\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)

Regex demo

Example, using input_file.read() to read the whole file:

import re

output_file = open('file_out', 'w')
input_file = open('file', 'r')

t = re.sub(
    '^(?!paragraph_b)\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)',
    '',
    input_file.read(),
    0,
    re.M
)
output_file.write(t)

Contents of output.txt

some random texts (100+ lines)
bbb
...
ttt
some random texts
ccc
...
fff    


paragraph_b different_story(
...
some random texts
...
)
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Thanks for the answer but even though I substitute re.sub method to your suggestions, the code didn't work.. – Parine Aug 21 '22 at 14:06
  • @Parine I have added example code to the answer. – The fourth bird Aug 21 '22 at 14:07
  • thanks, I've tried on yours but the output remains same as input file.. – Parine Aug 21 '22 at 14:12
  • @Parine Did you try testing this with the data that you shared in the question? Can you share a part of the real file? – The fourth bird Aug 21 '22 at 14:18
  • 1
    Oh.. there is 1-letter space on your first answer.( ^\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)) when I deleted that space, It works!! thank you very much. It helped me a lot. – Parine Aug 21 '22 at 14:24
0

Your code doesn't work because you're parsing text line by line:

for line in input:

That way your regex has no chance to match entire file content. You're better off reading it all at once and store it in single string variable, then apply your modifications with regex using that string variable.

PookyFan
  • 785
  • 1
  • 8
  • 23