0

I am using urllib2 to get a web-page, and I need to look for a specific value within the returned data.

Is the best way to do this by using Beautiful Soup and using the find method or by using a regex to search the data?

Here is a very basic example of the text that is returned by the request:

<html>
<body>
<table> 
   <tbody> 
      <tr>
         <td>
            <div id="123" class="services">
               <table>
                  <tbody>
                     <tr>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> Example BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                     </tr>

                     <tr>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                     </tr>

                     <tr>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                        <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
                     </tr>
                  </tbody>
               </table>
            </div>
         </td>
      </tr>
   </tbody>
</body>
</html>

In this case I want to return "Example BLAB BLAB BLAB". The only thing that remains persistent within this is "Example" and I want to return all of the data within this particular tag.

Ciaran
  • 1,139
  • 2
  • 11
  • 14

1 Answers1

5

Don't use regular expression to parse html/xml.

Using BeautifulSoup, you can use css selector:

>>> from bs4 import BeautifulSoup
>>>
>>> html_str = '''
... <html>
... <body>
... <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> Example BLAB BLAB BLAB </td>
... <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
... <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
... <td style="PADDING-LEFT:  5px"bgcolor="ffffff" class="style8"> BLAB BLAB BLAB </td>
... </body>
... </html>
... '''
>>> soup = BeautifulSoup(html_str)
>>> for td in soup.select('.style8'):
...     print(td.text)
...
 Example BLAB BLAB BLAB
 BLAB BLAB BLAB
 BLAB BLAB BLAB
 BLAB BLAB BLAB
Community
  • 1
  • 1
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • It wouldn't work for my particular use case as I have not fully explained the structure of the data I think. I have edited the original post – Ciaran Feb 18 '14 at 16:43
  • @Ciaran, Do you want get the text of the first `td` element inside `div` with id `123`? – falsetru Feb 18 '14 at 16:51
  • It prints nothing when I attempt to select by the css class. I can print the entire soup and can confirm that it is being rendered correctly – Ciaran Feb 18 '14 at 16:58
  • 1
    @Ciaran, There was a typo in the code. fixed it (`.styles8` to `.style8`). Check it out. – falsetru Feb 18 '14 at 17:01
  • @falsetru, Yea I caught that originally, forgot to point it out :) Still doesn't work.. It won't print anything by their class name... My first script using python instead of PHP. Thanks for all the help – Ciaran Feb 18 '14 at 17:07
  • @Ciaran, It works for me. See http://asciinema.org/a/7713 . By the way, did you install `bs4` (BeautifulSoup4), not `BeautifulSoup` (BeautifulSoup3) ? – falsetru Feb 18 '14 at 17:09
  • @falsetru, I have installed bs4. I think it must be that beautifulsoup cannot parse a certain part of the page which is why it is unable to find the style8 class. Is there anyway to validate the content of the page or to skip non-parsable sections? – Ciaran Feb 19 '14 at 09:48
  • @Ciaran, The given html in the question works perfect as I show in the sccinema. I don't understand what you mean. – falsetru Feb 19 '14 at 09:48
  • @Ciaran, If the bs4 does not parse your content, please post a separated question about it. – falsetru Feb 19 '14 at 09:50
  • @falsetru, Sorry for the misunderstanding, that is only part of the HTML content of the page. I will create another question. Thanks for the help, it is much appreciated – Ciaran Feb 19 '14 at 10:21
  • @Ciaran, You'd better to include full content of the page, or the url in the question that answerer could retrieve. – falsetru Feb 19 '14 at 10:23
  • Yea will do, the page is very lengthy which is why I did not post it here. – Ciaran Feb 19 '14 at 10:24