2

i was trying to use selenium to scrape the titles of essays on this website: http://www.ncbi.nlm.nih.gov/pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)

#coding="utf-8"

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

domain = "http://www.ncbi.nlm.nih.gov/"
url_tail = "pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)"
url = domain + url_tail

browser = webdriver.Firefox()
browser.get(url)
time.sleep(5)

def extract_data(browser):
    titles = browser.find_elements_by_css_selector("div.rprt div.rslt p.title a")
    return [title.text for title in titles]

page_start = 1
page_end = 10

f = open('titles.txt', 'a')
for page in range(page_start, page_end):
    print "page %d" % page
    page_jump_box = browser.find_element_by_class_name("num").clear()
    page_jump_box_cleared = browser.find_element_by_class_name("num")
    page_jump_box_cleared.send_keys(str(page) + Keys.RETURN)

    time.sleep(15)

    f = open('titles.txt', 'a')
    for line in extract_data(browser):
        f.write(line + '\n')

f.close()

when i run it,i got this:

gao@gao:~/crawler$ python crawler3.0.py 
page 1
page 2
page 3
page 4
Traceback (most recent call last):
  File "crawler3.0.py", line 33, in <module>
    f.write(line + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 36: ordinal not in range(128)

When i searched on Stackoverflow,i found a similar question: UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128). i learned that when you use str(),it will cause the unicode problem.but in my code,i only use the str() to make the page number to be a string.So,how to correct the code.

And here is another question.I've learned that if i want to use phantomjs with selenium,i only need to change the browser = webdriver.Firefox() into browser = webdriver.PhantomJS(),but when i do this,the contents that i scraped are repeated(only the titles of page 1 was scraped).

I'm not a native English speaker,if there are any grammar mistake or whatever mistake,please let me know.

thanks in advance.

Community
  • 1
  • 1

1 Answers1

2

You need to encode the line before writing to the file:

for line in extract_data(browser):
    f.write(line.encode('utf-8') + '\n')

As for your second issue, I suggest the following improvements (that would make it work):

  • use Explicit Waits instead of time.sleep() calls - this would also dramatically improve the performance
  • instead of typing the page number, click "Next" button
  • open the file in "append" mode before the loop and use with context manager
  • close() the browser after you are done

The code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

domain = "http://www.ncbi.nlm.nih.gov/"
url_tail = "pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)"
url = domain + url_tail

browser = webdriver.PhantomJS()
browser.get(url)


def extract_data(browser):
    titles = browser.find_elements_by_css_selector("div.rprt div.rslt p.title a")
    return [title.text for title in titles]


page_start, page_end = 1, 10

with open('titles.txt', 'a') as f:
    for page in range(page_start, page_end):
        WebDriverWait(browser, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "div.rprt p.title"))
        )

        for line in extract_data(browser):
            f.write(line.encode('utf-8') + '\n')

        print "page %d" % page

        browser.find_element_by_css_selector("div.pagination a.next").click()

browser.close()

This produces titles.txt with titles from the result pages 1-9:

Robotic-assisted tubal anastomosis with one-stitch technique.
Effectiveness of pictorial health warnings on cigarette packs among Lebanese school and university students.
...
Importance and globalization status of good manufacturing practice (GMP) requirements for pharmaceutical excipients.
Systemic review on drug related hospital admissions - A pubmed based search.
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks.It works.But when i change `browser = webdriver.Firefox()` into `browser = webdriver.PhantomJS()`,it still scrape the repeated content.Do you know why? –  Feb 24 '15 at 17:01
  • @TongfeiGao Ok, well, is is s separate problem, let me have a look. – alecxe Feb 24 '15 at 17:03