How to remove text between a pair of substrings in Python when the string is very long

Question

I'm parsing a HTML page from and want to remove all the text between '<!DOCTYPE html>' and 'count green'. So for example, if the string in the text file (TestFile.txt) is

<!DOCTYPE html>FOOBAR count green

I would like to return

<!DOCTYPE html> count green

My code is

import re

# open text file
with open ("TestFile.txt", "r") as myfile:
    data=myfile.read().replace('\n', '')

# find text at start to replace
removeStartCompile = re.compile('<!DOCTYPE html>(.*?)count green')
removeStartSearch = removeStartCompile.search(data)
removeStart = removeStartSearch.group(1)

data = re.sub(removeStart,"",data)
print (data)

This is an example and it works. However, when I expand the text file to a full html code (you can imagine it get's pretty large), I end up trying to parse about 300,000 characters and I get a bad character range error.

Anyone have any ideas?

Be careful: People have [gone mad](http://stackoverflow.com/a/1732454/103081) trying to parse HTML with regex. — Paul, Aug 12 '15 at 01:12

score 2 · Accepted Answer · answered Aug 12 '15 at 01:10

2

Rather than using regex, you could try using Python's stdlib string functions:

starttext = "<!DOCTYPE html>"
endtext = "count green"

start = data.index(starttext) + len(starttext)
end = data.index(endtext)

output = data[:start] + data[end:]

answered Aug 12 '15 at 01:10

maxymoo

35,286
11
92
119

How to remove text between a pair of substrings in Python when the string is very long

1 Answers1