i was trying to use selenium to scrape the titles of essays on this website: http://www.ncbi.nlm.nih.gov/pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)
#coding="utf-8"
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
domain = "http://www.ncbi.nlm.nih.gov/"
url_tail = "pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)"
url = domain + url_tail
browser = webdriver.Firefox()
browser.get(url)
time.sleep(5)
def extract_data(browser):
titles = browser.find_elements_by_css_selector("div.rprt div.rslt p.title a")
return [title.text for title in titles]
page_start = 1
page_end = 10
f = open('titles.txt', 'a')
for page in range(page_start, page_end):
print "page %d" % page
page_jump_box = browser.find_element_by_class_name("num").clear()
page_jump_box_cleared = browser.find_element_by_class_name("num")
page_jump_box_cleared.send_keys(str(page) + Keys.RETURN)
time.sleep(15)
f = open('titles.txt', 'a')
for line in extract_data(browser):
f.write(line + '\n')
f.close()
when i run it,i got this:
gao@gao:~/crawler$ python crawler3.0.py
page 1
page 2
page 3
page 4
Traceback (most recent call last):
File "crawler3.0.py", line 33, in <module>
f.write(line + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 36: ordinal not in range(128)
When i searched on Stackoverflow,i found a similar question: UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128).
i learned that when you use str(),it will cause the unicode problem.but in my code,i only use the str() to make the page
number to be a string.So,how to correct the code.
And here is another question.I've learned that if i want to use phantomjs with selenium,i only need to change the browser = webdriver.Firefox()
into browser = webdriver.PhantomJS()
,but when i do this,the contents that i scraped are repeated(only the titles of page 1 was scraped).
I'm not a native English speaker,if there are any grammar mistake or whatever mistake,please let me know.
thanks in advance.