Scraping with Beautiful Soup: Why won't the get_text method return the text of this element?

Question

Lately I've been working on a project in python that involves scraping a few websites for some proxies. The problem I'm running into with this is that when I try to scrape a certain well known proxy site, Beautiful Soup doesn't do what I expect when I ask it to find where the IPs are in the table of proxies. I'll attempt to scape for the IPs for each proxy, and I'll get outputs like this when I use Beautiful Soup's .get_text() method on the appropriate element.

...

.UbZT{display:none}
.f5fa{display:inline}
.Glj2{display:none}
.cUce{display:inline}
.zjUZ{display:none}
.GzLS{display:inline}
98120169.117.186373161218218.83839393101138154165203242 

...

Here's the element that I'm trying to parse (the td tag which contains the IP):

<td><span><style>
.lLXJ{display:none}
.qRCB{display:inline}
.qC69{display:none}
.V0zO{display:inline}
</style><span style="display: inline">190</span><span class="V0zO">.</span><span 
style="display:none">2</span><div style="display:none">20</div><span 
style="display:none">51</span><span style="display:none">56</span><div 
style="display:none">56</div><span style="display:none">61</span><span 
class="lLXJ">61</span><div style="display:none">61</div><span 
class="qC69">110</span><div 
style="display:none">110</div><span style="display:none">135</span><div 
style="display:none">135</div><span class="V0zO">221</span><span 
style="display:none">234</span><div style="display:none">234</div><span class="147">.
</span><span style="display: inline">29</span><div style="display:none">44</div><span 
style="display:none">228</span><span></span><span class="qC69">248</span>.<span 
style="display:none">7</span><span></span><span style="display:none">44</span><span 
class="qC69">44</span><span class="qC69">80</span><span></span><span 
style="display:none">85</span><span class="lLXJ">85</span><div 
style="display:none">85</div><span class="qC69">100</span><div 
style="display:none">100</div><span></span><span class="qC69">130</span><div 
style="display:none">130</div><div style="display:none">168</div>212<span 
style="display:none">230</span><span class="qC69">230</span><div 
style="display:none">230</div></span></td>

The actual text of this element is simply the IP for the proxy.

Here's the snippet of my code:

# Hide My Ass
pages = ['https://www.hidemyass.com/proxy-list']

for page in pages:
    hidemyass = Soup(requests.get(page).text)
    rows = hidemyass.find_all(lambda tag:tag.name=='tr' and tag.has_attr('class'))
    for row in rows:
        fields = row.find_all('td')
        # get ip, port, and protocol for proxy
        ip = fields[1].get_text()            # <-- Here's the above td element
        port = fields[2].get_text()
        protocol = fields[6].get_text().lower()
        # store proxy in database
        db.add_proxy({'ip':ip,'port':port,'protocol':protocol})
        num_found += 1

Is there a correct way to parse this element so that the output won't get jumbled up like this? It seems intuitive that Beautiful Soup's .get_text() method would return exactly the text that is visible on the site, but I suppose that's not true. Thanks for any help or advice.

score 5 · Accepted Answer · edited May 23 '17 at 11:43

5

BeautifulSoup cannot distinguish visible text from other text in the HTML markup. This particular website does a very good job of obfuscating the markup and makes web-scraping of the page more complex. You can try to understand what text is visible but it's not that easy since there are a lot of irrelevant elements being inserted that can be directly made invisible via style or via the class. Some of the IP parts are in spans, some of them are not a part of any tag.

One workaround would be to use Selenium which can grab only visible text from the element. For example, this code will print you all the IPs in the particular table:

from selenium.webdriver.firefox import webdriver

browser = webdriver.WebDriver()
browser.get('https://www.hidemyass.com/proxy-list')

rows = browser.find_elements_by_xpath('//table[@id="listtable"]//tr')
for row in rows[1:]:
    cells = row.find_elements_by_tag_name('td')
    print cells[1].text

browser.close()

See also:

BeautifulSoup Grab Visible Webpage Text

Hope that helps.

edited May 23 '17 at 11:43

Community

1
1

answered May 02 '14 at 15:22

alecxe

462,703
120
1,088
1,195

Thanks for the info. I was really hoping I wouldn't have to use selenium, since it would be pretty clunky for just scraping. – Marco Giancarli May 02 '14 at 15:38
+1: I looked into the Javascript of this site and it's really smart. It targets the specific classes, ids, or other attributes and strips them. I tried circumventing the JS, but too much overhead. Selenium indeed is a much better approach. – WGS May 02 '14 at 15:38
@Nanashi yeah, thanks. I was suprised to see how much is involved in obfuscating the data. You don't see this every day. :) – alecxe May 02 '14 at 15:42
@Gold- note that you can use a "headless" browser with selenium, see [this](http://stackoverflow.com/questions/18539491/headless-browser-and-scraping-solutions) and [this](http://www.realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/). – alecxe May 02 '14 at 15:43
@alecxe: Well, if you're a site that provides a service to obfuscate IP tracking, then I'd be more secure in the thought that you even thought of obfuscating your site! Pretty smart. I have to consider an event like this in the future when I scrape a site. ;) – WGS May 02 '14 at 16:43

score 0 · Answer 2 · answered May 03 '14 at 01:22

I used this code to parse Hidemyass.com code some time ago (this is Perl and parsing HTML with regular expressions is a bad approach):

sub find_ip {

  my ($html) = @_;
  my $ip;

  my ($style_section) = $html =~ m{<style>(.+?)</style>};

  my (@bad_styles) = $style_section =~ m/

    \.(\w+)\s*\{display:\s*none\}
  /isxg;

  my $bad_styles = join("|", @bad_styles);

  $html =~ s{<div .+? </div>}{}isxg;
  $html =~ s{<span style="display:none">.+?</span>}{}g;
  $html =~ s{<style>.+?</style>}{};
  $html =~ s{^<span>|</span>$}{}g;
  $html =~ s{<span class="(?:$bad_styles)">.+?</span>}{}g;
  $html =~ s{</?[^>]+>}{}g;

  $ip = $html;

  return $ip;
}

Scraping with Beautiful Soup: Why won't the get_text method return the text of this element?

2 Answers2

Linked