1

I try to extract the "1 min" from the HTML code below using BeautifulSoup

<ul class="date-list infos">
 <li>
    <div class="date-list--time">1 min</div>
    <div class="date-list--extras"></div>
 </li>
 <li>
   <div class="date-list--time">30 min</div>
   <div class="date-list--extras"></div>
 </li>
</ul>

For this, I write the code below in Python:

# import libraries
import urllib2
from bs4 import BeautifulSoup

# specify the url
quote_page = 'http://beta.stm.info/fr/infos/reseaux/bus/reseau-local/ligne-51-est/56184'

page = urllib2.urlopen(quote_page)

# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')

# EXTRACT FIELD 1
name_titre = soup.find('div', attrs={'class': 'label not-accessible'})
name_t = name_titre.text.strip()
print name_t

# EXTRACT FIELD 2
time_passage = soup.find('div', attrs={'class': "date-list--time"})
t_passage = time_passage
print t_passage

It worked well for other data I wanted to extract (EXTRACT 1), but here I just get "None" as a print output for EXTRACT 2.

Could please someone tell me what I am doing wrong? I guess the issue is that the HTML includes a list of items for EXTRACT 2, but not sure...

Thanks!

  • Can you post all of your code? I was unable to reproduce your issue. Maybe it has something to do with how you are defining `soup`? – elethan Jan 22 '18 at 03:04
  • you need to add `.text` to your printed variable. check the beautiful soap documentation for examples. – rawsly Jan 22 '18 at 03:11
  • just added the code as requested by elethan... thanks! – oriolsierra Jan 22 '18 at 03:18
  • Your second last line should be `t_passage = time_passage.text`. If that still fails, perhaps the page retrieved by your program is different than your browser display (happens with webpages that require javascripts and such). – r.ook Jan 22 '18 at 03:22
  • The class `date-list--time` does not appear in the content of `page` it looks like. – elethan Jan 22 '18 at 03:24
  • When I add the .text I get: AttributeError: 'NoneType' object has no attribute 'text' – oriolsierra Jan 22 '18 at 03:25
  • Yes, in your case `time_passage.text` is expected to be `None` because the page you are parsing does not include the class you are targeting. – elethan Jan 22 '18 at 03:27
  • Hmm, when I visit that link directly I do indeed see the class in the page source. I am not sure why it does not come back in the response... – elethan Jan 22 '18 at 03:29
  • It seems to me that the issue is that there are several nested fields with the "date-list--time". They are nested into the "date-list infos" list. I guess there is a way to extract data from nested lists... – oriolsierra Jan 22 '18 at 03:39
  • See my updated answer. I think it is because those divs are probably generated dynamically when the page loads. I found a solution that works, but I am not sure if it will be suitable for your use case. – elethan Jan 22 '18 at 03:41

1 Answers1

4

Try the following:

In [1]: from bs4 import BeautifulSoup

In [2]: html = '''<ul class="date-list infos">
   ...:  <li>
   ...:     <div class="date-list--time">1 min</div>
   ...:     <div class="date-list--extras"></div>
   ...:  </li>
   ...:  <li>
   ...:    <div class="date-list--time">30 min</div>
   ...:    <div class="date-list--extras"></div>
   ...:  </li>
   ...: </ul>'''

In [3]: soup = BeautifulSoup(html, 'html.parser')

In [4]: time_passage = soup.find('div', attrs={'class': "date-list--time"})

In [5]: print time_passage
<div class="date-list--time">1 min</div>

To get the text of the div:

In [6]: print time_passage.text
u'1 min'

My [4] and [5] are just copied from your example code, so maybe your soup object is not what you think it is. I would try to do what you are doing interactively as in my example, and if it still doesn't work as you expect, inspect the objects that you are working with, e.g., what is soup? What string was it parsed from? etc.

Also, another caveat with BeautifulSoup is that if you try to access attributes of Tag object that are simply not there, you will get None instead of an AttributeError, so if you accidentally do time_passage.txt you will get None instead of your expected value, and you will have no AttributeError indicating that you have made a mistake.

Update:

It seems like the content you are trying to get at is dynamic and won't even render for your request, and so I don't think you will be able to get at it the way you are trying to (though I could be wrong). One solution would be to use selenium as described in this answer:

In [7]: from selenium import webdriver

In [8]: driver = webdriver.Chrome()

In [9]: driver.get('http://beta.stm.info/fr/infos/reseaux/bus/reseau-local/ligne-51-est/56184')

In [10]: html = driver.page_source

In [11]: soup = BeautifulSoup(html)


In [12]: time_passage = soup.find('div', attrs={'class': "date-list--time"})

In [13]: time_passage.text
Out[13]: u'1 min'
elethan
  • 16,408
  • 8
  • 64
  • 87