Python - Grab only Visually Experience-able TEXT from a web page. w/o using Beautiful Soup

Question

The way i look at this situation is like this.. rather than eliminate the unwanted stuff.. also known as html, javascript, css tags and bad html tags.. why not simply look for what is being looked for ?..

what is being looked for is.. text that is visually experienced when the web page is loaded in the web browser.

the text is likely to be inside of certain divs and spans but of course there is no knowing if they are hidden using commenting out in html '

 ( <!--  example --> )

perhaps we need a way to take a snap shot of the web page then convert the image to pdf then the pdf to text and grab the text from the pdf file ?

of course that was simply a joke.. but seriously.. is there a way to do the equivalence of that while working with parsed raw web page source ?

i do not want to use Beautiful Soup because incase of future errors there is no knowing how to modify things..

modules just aren't what i am looking for because the future changes and i need to be ready to modify my codes to control the future instead of letting it control me.

i saw a solution it would look for '<' soon as it found it.. it would delete every character one by one until it hit a '>' unfortunately my python script showed me an error and said 'delete' was not defined..

but something like that is what i am looking for.

perhaps someday i can reverse that and make it only keep what i am looking for rather than delete everything and keep what is not in the tags.

This is what the code looked like..

txt = []
for i in html:
    if i == '<':
        delete = True
        continue
    if i == '>':
        delete = False
        continue
    if delete == True:
        continue

    txt.append(i)

why does this code error and say ' delete ' is not defined ?

here where i located this example above.

Removing html tags using python?

this approach is rather inline with imagination itself. maybe other modules are already working this way. but this is more open source and thus the mind can think about fixing it in the future incase of upgrades and modifications

UPDATE

data="<title>ooo</title>"
delete = False
data2[] = ""
for i in data:
    if i == '<':
        delete = True
    if i == '>':
        delete = False
    if delete == True:
        continue
    data2.append(i)

print data2

UPDATE

here's a working model that prints the exact opposite of what i am looking for.. this code too needs to be fixed.

data="<title>ooo</title>"
record = "yes"
recorded= []
for i in data:
    if i == '<':
        record = "yes"
    if i == '>':
        record = "no"
    if record == "yes":
        recorded.append(i)

print recorded

result is

>>> print recorded
['<', 't', 'i', 't', 'l', 'e', '<', '/', 't', 'i', 't', 'l', 'e']

UPDATE

Finally i fixed it somehow but it needs to now stop recording the not wanted > character.

data="<title>ooo</title>"
record = "yes"
recorded= []
for i in data:
    if i == '<':
        record = "no"
    if i == '>':
        record = "yes"
    if record == "yes":
        recorded.append(i)

the output is

>>> print recorded
['>', 'o', 'o', 'o', '>']

**1** What is `data2[] = ""`? To define an array use `data2 = []` **2** Why did you remove the `continue`? **3** I still stand by my opinion: it's an awful way to do the job. **4** Instead of `data2` being an array, why not make it a string and use `data2 += i`? — Robin, Apr 17 '14 at 17:57
@Robin I pretty much fixed it, take a look at the final update in the question.. the only problem now is the >'s that are showing up. — Void State -- Sümer Kolçak, Apr 17 '14 at 18:27
Switch `if i == '>': record = "yes"` and `if record == "yes": recorded.append(i)`. Might I suggest you post your code to http://codereview.stackexchange.com/ to learn what's wrong with it? — Robin, Apr 17 '14 at 18:32
@Robin, i understand that it needs to record if the `i == '>'` but how do you ensure it skips the `>` so that `>` does not get recorded ? — Void State -- Sümer Kolçak, Apr 17 '14 at 18:35
Well to do *that*, one would have to follow the advice in my previous comment and switch the two statements with each other. — Robin, Apr 17 '14 at 18:41
@Robin, ok thanks. I will select your answer as valid answer because in the comments sections there has been lots of help. i will figure out the rest myself .. i should have not even asked that last question about eliminating extra `>`'s given that.. that is off topic to the question and also easy to solve — Void State -- Sümer Kolçak, Apr 17 '14 at 18:46

score 0 · Accepted Answer · edited May 23 '17 at 12:28

0

If you use your code on an html file that doesn't start (as in, the first character isn't) with a <, indeed this will throw an error:

// i = 'a'
if i == '<':        // nope
    delete = True
    continue
if i == '>':        // nope
    delete = False
    continue
if delete == True:  // what's delete?
    continue

If you want your code to work you need to define delete = False before your loop.

However, if I understood your (rather long) text, all you want to do is delete all HTML tags? I believe there are safer ways to do so.

Your homemade code has weaknesses of its own: for example, what happens when you encounter <img alt="next->">? What happens if your text is 1<3?

I would strongly recommend using a library for this task.

edited May 23 '17 at 12:28

Community

1
1

answered Apr 17 '14 at 17:18

Robin

9,415
3
34
45

The code is rather transparent thus can be modified.. in the situation of `1<3` perhaps we can say the code should be modified so it starts deleting when there is `<"` that would solve that issue. as you can see when the code is not a module there is much more freedom to modify it. this is why i am planning on sticking with this code if i can make it work first. will be testing out ` delete = False ` as recommended. thanks – Void State -- Sümer Kolçak Apr 17 '14 at 17:24
BTW i am testing it out with `delete = False` before the loop begins.. although there are no more errors.. it simply prints `[]` – Void State -- Sümer Kolçak Apr 17 '14 at 17:34
Please edit your question to add the html you're working on (at least part of it). The code is not as easy to modify as you seem to think it is: how do you solve the `` issue (not sure to get your other fix)? Your task isn't very original, I believe with minimal research you could find a flexible tool that suits your needs. But obviously, the choice is yours. – Robin Apr 17 '14 at 17:42
When there is `<` The code should be declaring `delete = True` and bypassing everything.. when it hits a `>` it needs to declare `delete = False` ... and for every time `delete` is `False` it is to append the character to `data2` array. – Void State -- Sümer Kolçak Apr 17 '14 at 18:02
I guess the correct way to re-write this code would be to say.. set ignore to false by default and add every character to the array. when it hits a `<` the ignore is set to true and every character is ignored until the `>` – Void State -- Sümer Kolçak Apr 17 '14 at 18:06
@VoidState--SümerKolçak: Isn't that what I suggested, among other things, in my answer? – Robin Apr 17 '14 at 18:09
i actually did not even know how this code worked until i started discussing it in this question/answer.. the only problem is i am trying to get it to print something but so far it's not printing out anything – Void State -- Sümer Kolçak Apr 17 '14 at 18:16

Python - Grab only Visually Experience-able TEXT from a web page. w/o using Beautiful Soup

1 Answers1