1

I am learning webscraping via BeautifulSoup and Python. My first project is to extract certain recipes from cookpad.hu. I was successfully able to extract but now I'm having troubles with actually writing them to a file (csv is all I know how to do), due to this error:

Traceback (most recent call last): File "cookpad_scrape.py", line 24, in f.writerow(about_clean) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)

My code is below. I am using Python 2.7.14 on Ubuntu. A pastebin of the webpage is here, but the webpage itself is this.

I'm assuming it can't write the Hungarian letters? I'm sure there is a terribly simple solution I am overlooking.

import requests
from bs4 import BeautifulSoup 
import csv 

'''
Tree of page:
    <div id="recipe main">
        <div id="editor" class="editor">
            <div id="about">
            <section id="ingredients">
            <section id="steps">
'''
#text only: soup.get_text()

page = requests.get('https://cookpad.com/hu/receptek/5040119-parazson-sult-padlizsankrem')
soup = BeautifulSoup(page.text, 'lxml')

f = csv.writer(open('recipes.csv', 'w')) #create and open file in f variable, using 'w' mode
f.writerow(['Recipe 1']) #write top row headings

about = soup.find(id='about')
about_ext = about.p.extract()
about_clean = about_ext.get_text()
f.writerow(about_clean)

ingredients = soup.find(id='ingredients')
ingredients_ext = ingredients.ol.extract()
ingredients_clean = ingredients_ext.find_all(itemprop='ingredients')
#for ingredient in ingredients_clean:

steps = soup.find(id='steps')
steps_p = steps.find_all(itemprop='recipeInstructions')
for step in steps_p:
    extracted = step.p.extract()
    print(extracted.text)
    f.writerow([extracted])

Solution: Run the script using python3, not 2 via python3 my_script.py

New problem: exporting the scrapes gets me good results for the steps, but ingredients and about section has each letter separated by commas.

sc4s2cg
  • 153
  • 9
  • Is this Python 2 or 3? (And, if 3, what 3.x version, what platform are you on, and what locale if Linux/what OEM codepage if Windows?) – abarnert May 29 '18 at 20:55
  • Please include the entire stack trace, not just the error message. It shows which line is in error. Also, what version of python, 2 or 3? – tdelaney May 29 '18 at 20:56
  • Also, please give us the entire exception—with traceback—rather than just the description string. I can guess that it's _probably_ one of the `writerow` calls that causes this, but the exception will tell us exactly which line. – abarnert May 29 '18 at 20:56
  • Finally, if you can give us a _complete_ (but minimal) HTML tree, instead of just a fragment of one that can't be parsed, we could actually run and debug your code ourselves. Please read [mcve] in the help for more guidelines on what to include in a question. – abarnert May 29 '18 at 20:57
  • As a side note: it looks like you're trying to write a CSV with just a single column, whose values are just simple strings that aren't going to include newlines or other control characters? If so, you really don't need a CSV; you can just write lines directly to the file. (If there might be newlines or other control characters in your data, ignore this comment.) – abarnert May 29 '18 at 20:59
  • Is there a reason you're using Python 2.7? If you're just starting out learning in 2018—and especially if you need to deal with non-ASCII text—learning 3.6 will be a whole lot easier. – abarnert May 29 '18 at 21:01
  • Is python 2.7 a requirement? Pythnon 3 has much better language support. Its been out for nearly a decade. Only use 2.x if you have a specific requirement to do so. – tdelaney May 29 '18 at 21:01
  • @abarnert - I'll leave it to you to advocate! – tdelaney May 29 '18 at 21:02
  • I'm not sure why, but sudo apt-get in Ubuntu brought me python 2.7. I can definitely upgrade to 3.6! Also added some more info OP. – sc4s2cg May 29 '18 at 21:04
  • You can almost solve this problem in 2.7 by using `io.open` or `codecs.open` to create a Python 3-style Unicode-aware text file—but that will still have problems because Python 2.7's `csv` module doesn't do Unicode right. So, you have to go into [the examples in the docs](https://docs.python.org/2/library/csv.html#examples) to copy all that "recoder" code and then learn how to use it. Much easier to just use 3.6 (`sudo apt-get python3` should do it… or use a later Ubuntu), where `csv` already works. – abarnert May 29 '18 at 21:05
  • If you want to do it the hard way, I can find a duplicate question here that shows you how to use the recoder stuff, but it really isn't worth learning if you don't need to. – abarnert May 29 '18 at 21:06
  • Oh my goodness. I just found out that python3 has "python3" as its call function, not "python". So I had python3 all along, but installed bs4 for python2 via sudo apt-get install python. The error is fixed after using python3, however writing to file gets weird. Each letter is separated by commas like so: https://imgur.com/a/jdcX5Wm – sc4s2cg May 29 '18 at 21:20
  • See post: [How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?](https://stackoverflow.com/questions/20205455/how-to-correctly-parse-utf-8-encoded-html-to-unicode-strings-with-beautifulsoup?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa) – Sahand Aslani May 29 '18 at 21:34
  • Your new error is completely different from, and unrelated to, the one you're asking about—but it's also a much simple one to fix. One of your `writerow` calls is trying to write a string, instead of a list of strings. A string acts like a list of one-character strings. If you want to fix that, just `writerow([about_clean])` instead of `writerow(about_clean)`, the same as you're already doing for the other write calls. – abarnert May 29 '18 at 21:35
  • @SahandAslani No, the OP here is not having a problem parsing HTML that falsely claims to be UTF-8, he's correctly parsing and decoding it, and then having a problem writing it to a CSV file (because Python 2's `csv` module doesn't do Unicode). – abarnert May 29 '18 at 21:36
  • @abarnert thanks a ton for your help! The brackets did the trick, and now I have a comprehensible recipe scraped from the site. There are some formatting issues (quotation marks, ingredients having an extra "enter" between them), but I'm sure I can figure that out or just use python to format the document a little. – sc4s2cg May 29 '18 at 22:10

1 Answers1

0

You're running python2. In line 25 you're writing out the contents of 'about_clean' variable. You need to encode this value.

f.writerow(about_clean.encode("utf-8"))
Bartek
  • 59
  • 4
  • This doesn't help, because he has multiple `writerow` calls, and because each one takes a _list_ of strings, not a single string. (Because that's the whole _point_ of `writerow`.) – abarnert May 29 '18 at 21:06
  • Then he might use: `f.writerow([v.encode("utf-8") for v in about_clean])` And maybe create a utility function. There was also unicodecsv module, I believe. – Bartek May 29 '18 at 21:30
  • I don't know of a `unicodecsv` module—although there probably is at least one. (I do know of a `utf8csv` module that wraps up the stuff in the docs examples and then simplifies them to only handle UTF-8, because I wrote it… but I honestly think it's better to read the examples in the docs, because you're never going to expand or debug Unicode CSV code in 2.x without understanding what those examples are doing…) – abarnert May 29 '18 at 21:33