0

I am getting familiar with Python & am struggling to do the below with BeautifulSoup, Python.

What is expected:

*If the output of the script below contains the string 5378, it should email me with the line the string appears.

#! /usr/bin/env python

from bs4 import BeautifulSoup
from lxml import html
import urllib2,re

import codecs
import sys
streamWriter = codecs.lookup('utf-8')[-1]
sys.stdout = streamWriter(sys.stdout)

BASE_URL = "http://outlet.us.dell.com/ARBOnlineSales/Online/InventorySearch.aspx?c=us&cs=22&l=en&s=dfh&brandid=2201&fid=111162"

webpage = urllib2.urlopen(BASE_URL)
soup = BeautifulSoup(webpage.read(), "lxml")
findcolumn = soup.find("div", {"id": "itemheader-FN"})
name = findcolumn.text.strip()
print name

I tried using findall(5378, name), but it returns to empty braces like this [].

  • I am struggling with Unicode issues if I am trying to use it along with grep.

$ python dell.py | grep 5378 Traceback (most recent call last): File "dell.py", line 18, in <module> print name UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 817: ordinal not in range(128)

Can someone tell me what am I doing wrong in both cases?

deppfx
  • 701
  • 1
  • 10
  • 24

1 Answers1

1

The function findall (from the re module) expects the first parameter to be a a regular expression, which is a string, but you provided an integer. Try this instead:

re.findall("5378", name)

When printed this will output [u'5378'] when it found something or [] when it didn't.

I suspect you want to retrieve the product name from the number, which means you have to iterate through elements in findcolumn. We can use re.search() here to check for a single match within the element's texts.

for input_element in findcolumn.find_all("div"):
    name = unicode(input_element.text.strip())
    if re.search("5378", name) != None:
        print unicode(name)

As for the unicode error, there are a bunch of solutions, depending on your operating system and configuration: Reconfigure your system locale on Ubuntu or Encode your script output with .encode()/unicode().

Community
  • 1
  • 1
chrki
  • 6,143
  • 6
  • 35
  • 55
  • Ty. I set the encoding at the OS level. It works! When I run it, it shows `[u'5378']` but I want the entire line to be printed instead of just `[u'5378']`. How can I do this? – deppfx Nov 28 '16 at 20:41
  • It works, thank you! Why do you have to redo a `findcolumn.find_all("div"):` and not just `findcolumn` as we are already doing it in `findcolumn = soup.find("div", {"id": "itemheader-FN"})`? So, I tried `for input_element in findcolumn:` and am getting this error and am getting this error `AttributeError: 'NavigableString' object has no attribute 'text'`. Can you please explain? – deppfx Nov 29 '16 at 23:20
  • @deppfx In `findcolumn` there are more child element `div`s, each of those contains one laptop product name, I'm using `find_all` to get these into an array. Otherwise you are just searching the single whole `findcolumn` element (a div containing more divs, inputs, etc.), it's not an array and that's why you get that error because it is treated as a Beautifulsoup `NavigableString` in that case (not iterable, i.e. can't be looped over with a `for .. in`) – chrki Nov 30 '16 at 07:07
  • I would double upvote you if I could. Thanks for explaining it in detail. :-) I am going to read and learn about what `NavigableString` in beautifulsoup is about. And why I can't do a `soup.find_all` in line 14 itself (it didn't work). – deppfx Nov 30 '16 at 20:11