The way i look at this situation is like this.. rather than eliminate the unwanted stuff.. also known as html, javascript, css tags and bad html tags.. why not simply look for what is being looked for ?..
what is being looked for is.. text that is visually experienced when the web page is loaded in the web browser.
the text is likely to be inside of certain divs and spans but of course there is no knowing if they are hidden using commenting out in html '
( <!-- example --> )
perhaps we need a way to take a snap shot of the web page then convert the image to pdf then the pdf to text and grab the text from the pdf file ?
of course that was simply a joke.. but seriously.. is there a way to do the equivalence of that while working with parsed raw web page source ?
i do not want to use Beautiful Soup because incase of future errors there is no knowing how to modify things..
modules just aren't what i am looking for because the future changes and i need to be ready to modify my codes to control the future instead of letting it control me.
i saw a solution it would look for '<' soon as it found it.. it would delete every character one by one until it hit a '>' unfortunately my python script showed me an error and said 'delete' was not defined..
but something like that is what i am looking for.
perhaps someday i can reverse that and make it only keep what i am looking for rather than delete everything and keep what is not in the tags.
This is what the code looked like..
txt = []
for i in html:
if i == '<':
delete = True
continue
if i == '>':
delete = False
continue
if delete == True:
continue
txt.append(i)
why does this code error and say ' delete ' is not defined ?
here where i located this example above.
Removing html tags using python?
this approach is rather inline with imagination itself. maybe other modules are already working this way. but this is more open source and thus the mind can think about fixing it in the future incase of upgrades and modifications
UPDATE
data="<title>ooo</title>"
delete = False
data2[] = ""
for i in data:
if i == '<':
delete = True
if i == '>':
delete = False
if delete == True:
continue
data2.append(i)
print data2
UPDATE
here's a working model that prints the exact opposite of what i am looking for.. this code too needs to be fixed.
data="<title>ooo</title>"
record = "yes"
recorded= []
for i in data:
if i == '<':
record = "yes"
if i == '>':
record = "no"
if record == "yes":
recorded.append(i)
print recorded
result is
>>> print recorded
['<', 't', 'i', 't', 'l', 'e', '<', '/', 't', 'i', 't', 'l', 'e']
UPDATE
Finally i fixed it somehow but it needs to now stop recording the not wanted > character.
data="<title>ooo</title>"
record = "yes"
recorded= []
for i in data:
if i == '<':
record = "no"
if i == '>':
record = "yes"
if record == "yes":
recorded.append(i)
the output is
>>> print recorded
['>', 'o', 'o', 'o', '>']