Python find text with BeautifulSoup

Question

I have a HTML comment at the end of source file.

<!-- FEO DEBUG OUTPUT [TextTransAttempted:RENAME_JAVASCRIPT(18), RENAME_IMAGE(7), MINIFY_JAVASCRIPT(25), (1), JAVASCRIPT_HTML5_CACHE(19), EMBED_JAVASCRIPT(1), RENAME_CSS(3), (1), IMAGE_COMPRESSION(7), RESPONSIVE_IMAGES(6), ASYNC_JAVASCRIPT(2);TextTransApplied:RENAME_JAVASCRIPT(18), RENAME_IMAGE(7), MINIFY_JAVASCRIPT(25), (1), JAVASCRIPT_HTML5_CACHE(19), EMBED_JAVASCRIPT(1), RENAME_CSS(3), (1), IMAGE_COMPRESSION(7), RESPONSIVE_IMAGES(6), ASYNC_JAVASCRIPT(2);TagTransAttempted:(8), ASYNC_JAVASCRIPT(61);TagTransFailed:ASYNC_JAVASCRIPT(42);TagTransApplied:(8), ASYNC_JAVASCRIPT(19); ] -->

Now I want to check if all the contents in brackets are greater than zero. For instance I want to get the value of 18 from RENAME_JAVASCRIPT and check if it is greater than zero and similarly for the rest of them. Since this is a comment and not a part of any html tag, is there a way in BeautifulSoup to achieve this.

http://stackoverflow.com/questions/6062210/how-to-find-the-comment-tag-with-beautifulsoup — ρss, Dec 22 '14 at 12:23

Padraic Cunningham · Answer 1 · 2014-12-22T13:30:37.547

0

I would just use re:

import re
from bs4 import BeautifulSoup
with open("/sample_html.txt") as f:
    soup = BeautifulSoup(f.read())
    tag = soup.find("html").next_sibling
    print(all( x > 0 for x in map(int,re.findall("\((\d+)\)",tag))))

    True

If you want to see the names:

from bs4 import BeautifulSoup
with open("/sample_html.txt") as f:
    soup = BeautifulSoup(f.read())
    tag = soup.find("html").next_sibling
    for ele in re.findall("\w+\(\d+\)",tag):
         if int(ele.split("(")[1].rstrip(")")) > 0:
            print(ele)
RENAME_JAVASCRIPT(18)
RENAME_IMAGE(7)
MINIFY_JAVASCRIPT(25)
JAVASCRIPT_HTML5_CACHE(19)
EMBED_JAVASCRIPT(1)
RENAME_CSS(3)
IMAGE_COMPRESSION(7)
RESPONSIVE_IMAGES(6)
ASYNC_JAVASCRIPT(2)
RENAME_JAVASCRIPT(18)
RENAME_IMAGE(7)
MINIFY_JAVASCRIPT(25)
JAVASCRIPT_HTML5_CACHE(19)
EMBED_JAVASCRIPT(1)
RENAME_CSS(3)
IMAGE_COMPRESSION(7)
RESPONSIVE_IMAGES(6)
ASYNC_JAVASCRIPT(2)
ASYNC_JAVASCRIPT(61)
ASYNC_JAVASCRIPT(42)
ASYNC_JAVASCRIPT(19)

edited Dec 22 '14 at 13:30

answered Dec 22 '14 at 12:38

Padraic Cunningham

176,452
29
245
321

Throws the following error. Traceback (most recent call last): File "body_parser.py", line 119, in print(all( x > 0 for x in map(int,re.findall("\((\d+)\)",feed)))) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 177, in findall return _compile(pattern, flags).findall(string) TypeError: expected string or buffer – station Dec 22 '14 at 12:42
Oh I get it , my input will be the whole HTML source and that comment will be at the end after – station Dec 22 '14 at 12:45
yes, I presumed you had extracted the html you provided in your question – Padraic Cunningham Dec 22 '14 at 12:47
No .. which is why I wanted to figure out a way with BeautifulSoup – station Dec 22 '14 at 12:49
can you add a link to the html. Getting the comment should be easy – Padraic Cunningham Dec 22 '14 at 12:52
You can also try hitting https://www.ubank.com.au/ with the request headers as "pragma : akamai-x-feo-trace" and see the html source./ – station Dec 22 '14 at 12:57
This again throws an error. I cannot parse the HTML as such since it does not come in the same order. Just need to use Regex to get the info out – station Dec 22 '14 at 13:15
This works but prints false print(all( x > 0 for x in map(int,re.findall("\((\d+)\)",str(feed))))) – station Dec 22 '14 at 13:18
it prints True for me so you must be doing something incorrectly, I also don't understand *I cannot parse the HTML as such since it does not come in the same order.*, you said the comment is after the html so there is no parsing of the html to be done – Padraic Cunningham Dec 22 '14 at 13:26
Ok so finally this prints it for ele in re.findall("[A-Z]+_[A-Z]+\(\d+\)",str(feed)): print ele – station Dec 22 '14 at 13:30
you are parsing the whole html with re, why are you not just parsing using next_sibling as in the code I provided? – Padraic Cunningham Dec 22 '14 at 13:32
If you are going to parse it all as a string `tag = f.read().split("!-- FEO DEBUG OUTPUT")[1]` would be a lot easier – Padraic Cunningham Dec 22 '14 at 13:40

Python find text with BeautifulSoup

1 Answers1