2

I scraped a web page, they contain some articles in Traditional Chinese, Simplified Chinese and English. There's no problem to save them in data and print, but when I tried to write them into my folder, it went error. I tried different ways to encoding them as UTF-8 in open but still not working. By the way, I compiled them on Anaconda's Jupyter.

Code:

for urls in all:
re=requests.get(urls)
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
title_tag = soup.select_one('.mms_article_title')
#print(title_tag.text)
list=[]
for tag in soup.select('.mms_article_content'):
    list.append(tag.text)
list=([c.replace('\n', '') for c in list])
list=([c.replace('\r', '') for c in list])
list=([c.replace('\t', '') for c in list])
list=([c.replace(u'\xa0', u' ') for c in list])
list= (', '.join(list))  
data={
    "Title" : title_tag.text,
    "Article": list
}
save_path= 'C:/json_n/'   
file_name=save_path+'%s.json' % title_tag.text
with open(file_name, 'w') as f:
    print(file_name)
    file = json.dumps(data,ensure_ascii=False)   
    f.write(file)

I have 1700 files and it only prints 2 file_name. It also saves these 2 files in the folder "json_n", but only the first json file successfully saved data, the second one was empty since its data was in simplified Chinese, it could not write.

C:/json_n/肝動脈栓塞術.json
C:/json_n/心臟電氣生理學檢查注意事項(簡體中文).json

Error:

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-39-e73321a3e622> in <module>()
     21         print(file_name)
     22         file = json.dumps(data,ensure_ascii=False)
---> 23         f.write(file)

UnicodeEncodeError: 'cp950' codec can't encode character '\u810f' in position 67: illegal multibyte sequence

When I set encoding in open:

with open(file_name, 'w', encoding="utf-8") as f:

It still prints out 2 file_name, and the second one is still empty.

C:/json_n/肝動脈栓塞術.json
C:/json_n/心臟電氣生理學檢查注意事項(簡體中文).json

Error:

OSError                                   Traceback (most recent call last)
<ipython-input-44-256bcf14fcbe> in <module>()
     18     save_path= 'C:/json_n/'
     19     file_name=save_path+'%s.json' % title_tag.text
---> 20     with open(file_name, 'w', encoding="utf-8") as f:
     21         print(file_name)
     22         file = json.dumps(data,ensure_ascii=False)

OSError: [Errno 22] Invalid argument: 'C:/json_n/如何使用胰島素空針抽取短效型胰島素?.json'
Makiyo
  • 441
  • 5
  • 23
  • I assume 'cp950' means Code page 950, which is not Unicode. Perhaps that's the source of the trouble. – zindorsky Oct 19 '17 at 16:05
  • Maybe try adding the parameter `encoding="utf-8"` to the open call. – zindorsky Oct 19 '17 at 16:07
  • @zindorsky but this error happened on the second page – Makiyo Oct 19 '17 at 16:11
  • @zindorsky I did, still not working. – Makiyo Oct 19 '17 at 16:12
  • Totally off topic, but please don't use `list` as a variable name. – Mark Ransom Oct 19 '17 at 16:53
  • Opening the file with `encoding=` is the right way to go. It looks like Windows itself is complaining about the file name, you should print the name *before* you try to open the file. – Mark Ransom Oct 19 '17 at 16:54
  • @MarkRansom Hi Mark! Thanks for your advice. I did it, but it still stop at the third file. Is there possible a ways to skip it or just put the file_name as something it can read? – Makiyo Oct 19 '17 at 17:17
  • 1
    Looking more closely at the filename that fails, I see there's a `?` in it. I thought maybe that was an indication of an improperly encoded character, but now I realize that it's really a literal question mark. That makes it an invalid filename. See https://stackoverflow.com/questions/1033424/how-to-remove-bad-path-characters-in-python – Mark Ransom Oct 19 '17 at 18:52
  • @MarkRansom I didn't think it's a problem since last time I successfully saved file_name with ? in the folder. Yesterday I tried to remove "?" and it ran smoothly! – Makiyo Oct 20 '17 at 01:47

0 Answers0