BeautifulSoup extract text from comment html

Question

Apologies if this question is simular to others, I wasn't able to make any of the other solutions work. I'm scraping a website using beautifulsoup and I am trying to get the information from a table field that's commented:

<td>
    <span class="release" data-release="1518739200"></span>
    <!--<p class="statistics">

                      <span class="views" clicks="1564058">1.56M Clicks</span>

                        <span class="interaction" likes="0"></span>

    </p>-->
</td>

How do I get the part 'views' and 'interaction'?

try `soup.select('span[class="views"])` (and with `interaction` respectively) — MCO, Oct 06 '18 at 12:44
@DušanMaďar I tried these, but I get `AttributeError: 'Comment' object has no attribute 'decompose'` and with the `comments.extract()` it works but provides no results — Claudine, Oct 06 '18 at 12:50
@MCO `soup.select('span[class="views"]')` provides me with the following empty result `[]` , but not sure what you mean with 'and with interaction respectively'? — Claudine, Oct 06 '18 at 12:52
@Claudine try using `extract`:https://stackoverflow.com/a/33139332/4183498 — Dušan Maďar, Oct 06 '18 at 12:54
@DušanMaďar saw i tried the wrong one, but the extract is providing me no results unfortunately — Claudine, Oct 06 '18 at 12:55

Dan-Dev · Answer 1 · 2018-10-06T15:15:55.770

5

You need to extract the HTML from the comment and parse it again with BeautifulSoup like this:

from bs4 import BeautifulSoup, Comment
html = """<td>
    <span class="release" data-release="1518739200"></span>
    <!--<p class="statistics">

                      <span class="views" clicks="1564058">1.56M Clicks</span>

                        <span class="interaction" likes="0"></span>

    </p>-->
</td>"""
soup = BeautifulSoup(html , 'lxml')
comment = soup.find(text=lambda text:isinstance(text, Comment))
commentsoup = BeautifulSoup(comment , 'lxml')
views = commentsoup.find('span', {'class': 'views'})
interaction= commentsoup.find('span', {'class': 'interaction'})
print (views.get_text(), interaction['likes'])

Outputs:

1.56M Clicks 0

If the comment is not the first on the page you would need to index it like this:

comment = soup.find_all(text=lambda text:isinstance(text, Comment))[1]

or find it from a parent element.

Updated in response to comment:

You can use the parent 'tr' element for this. The page you supplied had "shares" not "interaction" so I expect you got a NoneType object which gave you the error you saw. You could add tests in you code for NoneType objects if you need to.

from bs4 import BeautifulSoup, Comment
import requests
url = "https://imvdb.com/calendar/2018?page=1"
html = requests.get(url).text
soup = BeautifulSoup(html , 'lxml')

for tr in soup.find_all('tr'):
    comment = tr.find(text=lambda text:isinstance(text, Comment))
    commentsoup = BeautifulSoup(comment , 'lxml')
    views = commentsoup.find('span', {'class': 'views'})
    shares= commentsoup.find('span', {'class': 'shares'})
    print (views.get_text(), shares['data-shares'])

Outputs:

3.60K Views 0
1.56M Views 0
220.28K Views 0
6.09M Views 0
133.04K Views 0
163.62M Views 0
30.44K Views 0
2.95M Views 0
2.10M Views 0
83.21K Views 0
5.27K Views 0
...

edited Oct 06 '18 at 15:15

answered Oct 06 '18 at 13:00

Dan-Dev

8,957
3
38
55

What do I input as `html = ` in this case? I have more than a table and each row has a comment value. This is how I find the place of the code that is comment `table = soup.find('table',{'class':'contentTable'}) for tr in table.find_all('tr'): comment_section = tr.find_all('td')[2] print(comment_section)` then it outputs me with code in first post. If i try to do `html = comment_section' it doesnt work `NoneType object is not callable` – Claudine Oct 06 '18 at 13:26
Can you post a URL? – Dan-Dev Oct 06 '18 at 13:29
Found it! Many thanks for the help :) – Claudine Oct 06 '18 at 13:36
One more follow up question: it gets stuck on `commentsoup` returning a TypeError `expected string or bytes-like object` – Claudine Oct 06 '18 at 14:43
Can you post a URL? or a link to the full HTML source code? – Dan-Dev Oct 06 '18 at 14:54
https://imvdb.com/calendar/2018?page=1 it'll be looking at the table `imvdbTable` – Claudine Oct 06 '18 at 15:01
Updated in response to comment. – Dan-Dev Oct 06 '18 at 15:16
Wow, I've been looking for 2 days, trying to figure out how to extract the information I needed. This is the first response that clearly stated that once you extract the comments, you then have to parse those, to use it. Thank you! – rchap Sep 30 '21 at 19:34

score 1 · Answer 2 · answered Oct 06 '18 at 15:44

The simplest and easiest solution would be to opt for .replace() function. All you need to do is kick out this  signs from the html elements and the rest are as it is. Take a look at the below script.

from bs4 import BeautifulSoup

htdoc = """
<td>
    <span class="release" data-release="1518739200"></span>
    <!--<p class="statistics">
        <span class="views" clicks="1564058">1.56M Clicks</span>
        <span class="interaction" likes="0"></span>
    </p>-->
</td>
"""
elem = htdoc.replace("<!--","").replace("-->","")
soup = BeautifulSoup(elem,'lxml')
views = soup.select_one('span.views').get_text(strip=True)
likes = soup.select_one('span.interaction')['likes']
print(f'{views}\n{likes}')

Output:

1.56M Clicks
0

score 0 · Answer 3 · answered Oct 06 '18 at 13:00

0

If you want only the views then:

views = soup.findAll("span", {"class": "views"})

You also can get the whole paragraph with:

p = soup.findAll("p", {"class": "statistics"})

Then you can get the data from the p.

answered Oct 06 '18 at 13:00

GTDiablo

13
4

Unfortunately this doesn't work, it gives me the following output `[]` – Claudine Oct 06 '18 at 13:03

BeautifulSoup extract text from comment html

3 Answers3

Linked