I'm parsing a HTML page from and want to remove all the text between '<!DOCTYPE html>' and 'count green'. So for example, if the string in the text file (TestFile.txt) is
<!DOCTYPE html>FOOBAR count green
I would like to return
<!DOCTYPE html> count green
My code is
import re
# open text file
with open ("TestFile.txt", "r") as myfile:
data=myfile.read().replace('\n', '')
# find text at start to replace
removeStartCompile = re.compile('<!DOCTYPE html>(.*?)count green')
removeStartSearch = removeStartCompile.search(data)
removeStart = removeStartSearch.group(1)
data = re.sub(removeStart,"",data)
print (data)
This is an example and it works. However, when I expand the text file to a full html code (you can imagine it get's pretty large), I end up trying to parse about 300,000 characters and I get a bad character range error.
Anyone have any ideas?