Delete words between specific elements in a txt-file with python

Question

I am a python beginner and have the following problem:

I have a text file ('demofile.txt') and want to cut out everything between two specific elements ({start} and {end}) multiple times.
As en example imagine the text file contains:

'AAAA {start} BBBB {end} CCCC {start} DDDD {end} EEEE {start} FFFF {end} GGGG'

The outcome should be:

'AAAA CCCC EEEE GGGG'

First I defined the two elements which work as the cutter

start = '{start}'  
end = '{end}'

The I tried to cut out the first part and used this code:

text_start = text.find(start)
text_new = text[0:text_start]
print(text_new)

The outcome is : 'AAAA ', which is what I wanted

For the next part I tried this:

text_start = text.find(end)
text_end = text.find(start, text_start)
text_new = text[text_start+len(end):text_end]
print(text_new)

The outcome is: 'CCCC ' which is again what I was looking for

Now I tried to put everything together and build a loop and failed :-)

text_start = text.find(start)
text_new = text[0:text_start]

text_end = 0

for parts in text.split("{"):
    text_start = text.find(end, text_end)
    text_end = text.find(start, text_start)
    text_new = text_new + text[text_start+len(end):text_end]
print(text_new)

The outcome is: 'AAAA CCCC EEEE GGG {start} BBBB {end} CCCC {start} DDDD {end}...' and lot more of that. Thus the Outcome was okay until "GGG", but one G is missing. And all the stuff afterwards should be deleted. I guess the loop continued somehow and the start of the loop with the split statement is crap. What is the solution here? I would like to understand what went wrong and change the code. Of course I am also interested in a shorter and more elegant way. I am sure what I did is quite terrible ;-) I found something with "regular expressions" but I was not able to get this going as well. Thanks for any idea.

(PS: any idea how I could save everything I cut out in a seperate file?)

According to the duplicate target `re.sub('{start}.*?{end}', '', your_string, flags=re.DOTALL)` should do the trick. — Georgy, May 31 '20 at 15:00

Anwarvic · Answer 1 · 2020-05-31T14:54:02.067

2

You can simply do it like so:

import re

text = "AAAA {start} BBBB {end} CCCC {start} DDDD {end} EEEE {start} FFFF {end} GGGG"

pattern = '(\s+{start} \w+ {end})'
text = re.sub(pattern, '', text)

print(text)
#AAAA CCCC EEEE GGGG

Now, you can write text into a new text file named new_file.txt like so:

# you can change the filename by replacing `new_file.txt` with any other name
with open("new_file.txt", "w") as fout:
    fout.write(text)

edited May 31 '20 at 14:54

answered May 31 '20 at 14:43

Anwarvic

12,156
4
49
69

Thank you very much, Anwarvic, for your quick response. This looks indeed much more elegant. I guess there is a typo or something? It just cut our the cutting elements, but not the text inbetween. The result I am looking for is 'AAAA CCCC EEEE GGGG' without BBBB, DDDD and FFFF. – Poldi May 31 '20 at 14:51
My bad, I misunderstood your question. I've edited my answer. Try it now, it should work just fine – Anwarvic May 31 '20 at 14:54
Top, thank you! :-) – Poldi May 31 '20 at 14:55

score 1 · Answer 2 · answered May 31 '20 at 14:46

1

You can do it like this:

text = 'AAAA {start} BBBB {end} CCCC {start} DDDD {end} EEEE {start} FFFF {end} GGGG'

start = '{start}'
end = '{end}'

while True:
    text_start = text.find(start)
    text_end = text.find(end) + len(end)
    if text_start == -1:
        break
    text = text[:text_start - 1] + text[text_end:]

print(text)

Or by using regex:

import re
text = 'AAAA {start} BBBB {end} CCCC {start} DDDD {end} EEEE {start} FFFF {end} GGGG'

start = '{start}'
end = '{end}'
text = re.sub(fr"{start}.*?{end} ", "", text) # f string requires python3.6+
print(text)

Output:

AAAA CCCC EEEE GGGG

answered May 31 '20 at 14:46

Asocia

5,935
2
21
46

Great! Thank you very much Asocia. You saved my day :-) Do you have an idea what was going wrong in my code? – Poldi May 31 '20 at 14:54
unfortunately, I don't. But I can say that using a `for` loop is not what you want in your case because you don't know how many times you need to iterate over. So `while` is a better choice. – Asocia May 31 '20 at 14:58

score 0 · Answer 3 · answered May 31 '20 at 15:06

You can try with re package in python, so the code will b esomething like:

import re
text = "AAAA {start} BBBB {end} CCCC {start} DDDD {end} EEEE {start} FFFF {end} GGGG"

you could use simply:

re.sub("{start}|{end}" , "", text)

but you can also use list comprehension :

words_to_save = [word for word in text.split() if word not in "{start}"]
words_to_save = [word for word in words_to_save if word not in "{end}"]

clean_text = " ".join(words_to_save)

Delete words between specific elements in a txt-file with python

3 Answers3