1

I have two computers, both running 64-bit Windows 7. One machine has python 32-bit, one is running python 64-bit. Both machines have 8GB of RAM.

I'm using BeautifulSoup to scrape a webpage, but I've been running into issues on my python64 machine. I've been able to figure out that the output of my len(str(BeautifulSoup(request.get(http://www.sampleurl.com).text))) in 64bit is only returning 92520 characters but on the same, static, site on my python32-bit machine, it's returning 135000 characters.

At some point in the past on my python64-bit machine I had python32-bit, but uninstalled it to install python64-bit because I was having issues installing scipy using pip install (turns out that wasn't the issue).

Anyway, I'm unsure as to why my 64bit python machine isn't returning the entire html string and I was wondering if anyone can help me understand what is going on and how can I fix it.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
exhoosier10
  • 121
  • 4
  • 8

1 Answers1

1

This is not a 32bit / 64bit issue. You are most likely a parser issue; one machine using lxml vs. html.parser on the other, for example.

Different parsers deal differently with broken HTML, and lxml is the default only when installed.

See for example:

etc.

Run import lxml on both machines to verify. When you replaced your Python installation on one machine with a 64-bit version, you likely didn't include a compatible lxml version.

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I've installed on the 32bit python machin and now both machines are limiting the output to the 92520 string length. – exhoosier10 Feb 19 '15 at 20:54
  • 1
    @exhoosier10: and do you have `lxml` installed on either? You can explicitly switch between parsers; pass in `'lxml'` or `'html.parser'` as the second argument and compare the outputs. – Martijn Pieters Feb 19 '15 at 20:56
  • using 'html.parser' worked. THanks. I wasn't able to find anything useful during my search, all of your links provided made sense. Is the root cause of this just poorly coded HTML on the website's part? – exhoosier10 Feb 19 '15 at 20:59
  • @exhoosier10: almost invariably, yes. I have seen problems on certain Ubuntu installations where `lxml`, or rather the dependency `libxml2` is not working quite right, but without more information I cannot say if that's the case with your Windows setups. – Martijn Pieters Feb 19 '15 at 21:01