2

I've written a beautifulsoup script that scrapes Japanese HTML. Everything seems to be working and I get zero error messages. When I print I get:

連鎖に打ち勝たねばならない」と述べ拍手を浴び etc

But in the same script, when I save the output in a csv I get:

\u5ddd\u3001\u6ce2\u4f50\u5834\uff13\u7279\u6d3e\u54e1\u304c\u8a71\u3057\u5408 etc

I assume the problem is in the write-to-csv part of the code, but I can't figure out what to do.

Here's the code:

def processData( pageFile ):
    f = open(pageFile, "r")
    page = f.read()
    f.close()
    soup = BeautifulSoup(page, 'html.parser')
    metaData = soup.find_all("div", {'class': 'detail001'})
    one = [ ]
    for html in metaData:
        text = BeautifulSoup(str(html).strip().replace("\n", ""),features="html.parser")
        text = text.get_text()
        one.append(text.strip())
    csvfile = open(dir2 + ".csv".encode("utf-8"), 'ab')
    writer = csv.writer(csvfile)
    for ones in zip(one):
        writer.writerow([one])
    csvfile.close()
dir1 = "/home/sveisa/"
dir2 = "test2"
dir = dir1 + dir2
csvFile = dir2 + ".csv"
csvfile = open(csvFile.encode("utf-8"), 'w')
writer = csv.writer(csvfile)
writer.writerow(["one"])
csvfile.close()
fileList = os.listdir(dir)
totalLen = len(fileList)
for htmlFile in fileList:
    path = os.path.join(dir, htmlFile)
    processData(path)

I'm using Ubuntu.

Kasi
  • 235
  • 2
  • 11

1 Answers1

3

It's about the encoding= which need to be assigned to your csv as the following:

with open("data.csv", 'w', encoding="UTF-8") as f:
    writer = csv.writer(f)
    writer.writerow(
        "\u5ddd\u3001\u6ce2\u4f50\u5834\uff13\u7279\u6d3e\u54e1\u304c\u8a71\u3057\u5408")

Output Content:

川、波佐場3特派員が話し合
  • Thank you! I get a typerror when I try this, I use Python 2.7, is that why? – Kasi Apr 12 '20 at 11:15
  • 1
    @Isak which `error` you getting. indeed you should consider moving to Python 3 as 2 is already reached end of life. but the error is not about the Python version. could you show the error? – αԋɱҽԃ αмєяιcαη Apr 12 '20 at 11:17
  • Thanks! I get this error: TypeError: 'encoding' is an invalid keyword argument for this function – Kasi Apr 12 '20 at 11:20
  • 1
    @Isak [edit] your post and include the current code which you are using. as i see you inserted the parameter in a wrong place. – αԋɱҽԃ αмєяιcαη Apr 12 '20 at 11:21
  • 1
    @Isak it's should be like the following `open(csvFile, 'w', encoding="UTF-8")`. also pay attention that `w` is different than `wb` where it's mean `write bytes` – αԋɱҽԃ αмєяιcαη Apr 12 '20 at 11:23
  • Thanks I updated the code above, and included the encoding that gives me a TypeError csvfile = open(csvFile, 'w', encoding="UTF-8") TypeError: 'encoding' is an invalid keyword argument for this function – Kasi Apr 12 '20 at 11:32
  • @Isak Hmm, it's about Python 2 yes. check [that](https://stackoverflow.com/questions/25049962/is-encoding-is-an-invalid-keyword-error-inevitable-in-python-2-x) – αԋɱҽԃ αмєяιcαη Apr 12 '20 at 11:35
  • Ah ok, I'll try with python 3 – Kasi Apr 12 '20 at 11:42
  • I get the same error now that I run the code with py3 :/ – Kasi Apr 12 '20 at 11:47
  • the code in my original post now includes utf-8 encoding and I get no error messages, but the output in csv is still gibberish – Kasi Apr 12 '20 at 12:01
  • I got it to work now, after the edits you suggested, and after opening it as utf 16 instead of utf 8. thank you so much for your help! – Kasi Apr 12 '20 at 12:08