0

I am trying to crawl links from a website and store in a text file.There are about 1000 links that I need to crawl but it gives error on about 24 links. I am very new to web crawling and would appreciate some help.

If I remove the try:except statement, I get an error. 'ascii' codec can't encode character u'\u201c' in position 34: ordinal not in range

I have tried all other similar questions and this is not a duplicate question.

fobj.write(link.text + "\n")

The line above gives the error.

url = "https://tools.wmflabs.org/enwp10/cgi-bin/list2.fcgi?run=yes&projecta=Economics&namespace=&pagename=&quality=&importance=&score=&limit=1000&offset=1&sorta=Importance&sortb=Quality"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
links = soup.findAll('a', href=True)
b = "https://en.wikipedia.org/"
fobj = open("file.txt", 'a')
fobj2 = open("links.txt", 'a')
for link in links:
    try:
        a = link["href"].encode('utf8')
        if b in a:
             fobj.write(link.text + "\n")
             fobj2.write(a + "\n")
    except:
        print("error")
fobj.close()
fobj2.close()
Mr. X
  • 11
  • 3
  • is `fobj.write(link.text + "\n")` the line that throws the exception? – kmaork Jun 22 '16 at 09:44
  • yes, when I removed that line, the error was gone. But I need to write those text into text files. – Mr. X Jun 22 '16 at 09:49
  • it happens because `link.text` isn't encoded with `utf8` so it is automatically encoded with `ascii`, which doesn't recognize some of the characters in that string – kmaork Jun 22 '16 at 09:52
  • Thanks a lot. It worked after I changed `link.text` to `link.text.encode('utf8')` – Mr. X Jun 22 '16 at 09:54

0 Answers0