1

I'm parsing a HTML page from and want to remove all the text between '<!DOCTYPE html>' and 'count green'. So for example, if the string in the text file (TestFile.txt) is

<!DOCTYPE html>FOOBAR count green

I would like to return

<!DOCTYPE html> count green

My code is

import re

# open text file
with open ("TestFile.txt", "r") as myfile:
    data=myfile.read().replace('\n', '')

# find text at start to replace
removeStartCompile = re.compile('<!DOCTYPE html>(.*?)count green')
removeStartSearch = removeStartCompile.search(data)
removeStart = removeStartSearch.group(1)

data = re.sub(removeStart,"",data)
print (data)

This is an example and it works. However, when I expand the text file to a full html code (you can imagine it get's pretty large), I end up trying to parse about 300,000 characters and I get a bad character range error.

Anyone have any ideas?

maxymoo
  • 35,286
  • 11
  • 92
  • 119
Iorek
  • 571
  • 1
  • 13
  • 31
  • Be careful: People have [gone mad](http://stackoverflow.com/a/1732454/103081) trying to parse HTML with regex. – Paul Aug 12 '15 at 01:12

1 Answers1

2

Rather than using regex, you could try using Python's stdlib string functions:

starttext = "<!DOCTYPE html>"
endtext = "count green"

start = data.index(starttext) + len(starttext)
end = data.index(endtext)

output = data[:start] + data[end:]
maxymoo
  • 35,286
  • 11
  • 92
  • 119