Parsing Web Page's Search Results With Python

Question

I recently started working on a program in python which allows the user to conjugate any verb easily. To do this, I am using the urllib module to open the corresponding conjugations web page. For example, the verb "beber" would have the web page:

"http://www.spanishdict.com/conjugate/beber"

To open the page, I use the following python code:

source = urllib.urlopen("http://wwww.spanishdict.com/conjugate/beber").read()

This source does contain the information that I want to parse. But, when I make a BeautifulSoup object out of it like this:

soup = BeautifulSoup(source)

I appear to lose all the information I want to parse. The information lost when making the BeautifulSoup object usually looks something like this:

<tr>
      <td class="verb-pronoun-row">
    yo      </td>
                        <td class="">
      bebo        </td>
                          <td class="">
      bebí        </td>
                          <td class="">
      bebía        </td>
                          <td class="">
      bebería        </td>
                          <td class="">
      beberé        </td>
        </tr>

What am I doing wrong? I am no professional at Python or Web Parsing in general, so it may be a simple problem.

Here is my complete code (I used the "++++++" to differentiate the two):

import urllib
from bs4 import BeautifulSoup

source = urllib.urlopen("http://www.spanishdict.com/conjugate/beber").read()
soup = BeautifulSoup(source)

print source
print "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"
print str(soup)

What do you do with the 'soup' variable? How did you determine information was lost? — Matt, Feb 23 '13 at 19:31
If I try to print the prettified version of the 'soup' variable, it doesn't contain the information I want. — user1594328, Feb 23 '13 at 19:34
I also tested it here, and didn't notice any information loss. In particular, when calling `str(soup)` and searching it I found the exact text you pasted above (*Edit:* same with `soup.prettify()`). Maybe your problem is in the way you're trying to retrieve that info, so please post the code where you use `soup`. — mgibsonbr, Feb 23 '13 at 19:34
How do you know you are losing it? Using `bs4`, I didn't lose any of the information. — Joel Cornett, Feb 23 '13 at 19:40
I changed the main post, please check it to see the code I have used. — user1594328, Feb 23 '13 at 19:41
I cut and pasted your code into my python interpreter and it worked perfectly. (Python 2.7.2) — Matt, Feb 23 '13 at 19:45
Ah, I see the problem. The data is being truncated. Compare: len(source) vs. len(str(soup)). This is probably an encoding issue. — Matt, Feb 23 '13 at 19:50
I don't think encoding is the problem, since it seems to have recognized the encoding correctly. However, BeautifulSoup is doing more transformations in the source (like converting `<` and `>` to their HTML Entities), so I don't expect the lengths to match. Besides, printing the results of either `str(soup)` and `soup.prettify()` to a file and opening with Firefox produces pages that look exactly like the source. I have no idea why the length decreased, but so far I couldn't identify any information loss... — mgibsonbr, Feb 23 '13 at 20:08
I still suspect an encoding issue. The problem matches up well with [this](http://stackoverflow.com/questions/7219361/python-and-beautifulsoup-encoding-issues). Maybe the length is a red herring though. @user1594328, can you give an example of calling a method on soup that should work but doesn't? — Matt, Feb 23 '13 at 20:16
When using `code`(str(soup).find("bebemos")) (a conjugation I know is in the original source), it returns an index of -1. On the other hand, calling the find() method on the original source does return a real index. — user1594328, Feb 23 '13 at 20:47
@user1594328 Do you have a different environment to test it? I ran your code both in Windows and Linux (Python 2.7) and it worked fine. However, when running under PyPy, `BeautifulSoup` worked fine but `bs4` did not (couldn't find "bebemos", just like you - and the text was heavily truncated). You might have found a bug in the library. — mgibsonbr, Feb 23 '13 at 21:31
I am using the built-in IDLE of Python 2.7 to test this code on Windows 8. In what other environment could I test it in? — user1594328, Feb 23 '13 at 21:39

score 1 · Answer 1 · answered Feb 24 '13 at 12:50

1

When I wrote parsers I've had problems with bs, in some cases, it didn't find that found lxml and vice versa, because of broken html. Try to use lxml.html.

answered Feb 24 '13 at 12:50

Ellochka Cannibal

1,750
2
19
31

score 0 · Accepted Answer · answered Mar 02 '13 at 21:35

Your problem may be with encoding. I think that bs4 works with utf-8 and you have a different encoding set on your machine as default(an encoding that contains spanish letters). So urllib requests the page in your default encoding,thats okay so data is there in the source, it even prints out okay, but when you pass it to utf-8 based bs4 that characters are lost. Try looking for setting a different encoding in bs4 and if possible set it to your default. This is just a guess though, take it easy.

I recommend using regular expressions. I have used them for all my web crawlers. If this is usable for you depends on the dynamicity of the website. But that problem is there even when you use bs4. You just write all your re manually and let it do the magic. You would have to work with the bs4 similiar way when looking foor information you want.

Parsing Web Page's Search Results With Python

2 Answers2

Linked