I scraped a web page, they contain some articles in Traditional Chinese, Simplified Chinese and English. There's no problem to save them in data and print, but when I tried to write them into my folder, it went error. I tried different ways to encoding them as UTF-8 in open but still not working. By the way, I compiled them on Anaconda's Jupyter.
Code:
for urls in all:
re=requests.get(urls)
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
title_tag = soup.select_one('.mms_article_title')
#print(title_tag.text)
list=[]
for tag in soup.select('.mms_article_content'):
list.append(tag.text)
list=([c.replace('\n', '') for c in list])
list=([c.replace('\r', '') for c in list])
list=([c.replace('\t', '') for c in list])
list=([c.replace(u'\xa0', u' ') for c in list])
list= (', '.join(list))
data={
"Title" : title_tag.text,
"Article": list
}
save_path= 'C:/json_n/'
file_name=save_path+'%s.json' % title_tag.text
with open(file_name, 'w') as f:
print(file_name)
file = json.dumps(data,ensure_ascii=False)
f.write(file)
I have 1700 files and it only prints 2 file_name. It also saves these 2 files in the folder "json_n", but only the first json file successfully saved data, the second one was empty since its data was in simplified Chinese, it could not write.
C:/json_n/肝動脈栓塞術.json
C:/json_n/心臟電氣生理學檢查注意事項(簡體中文).json
Error:
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-39-e73321a3e622> in <module>()
21 print(file_name)
22 file = json.dumps(data,ensure_ascii=False)
---> 23 f.write(file)
UnicodeEncodeError: 'cp950' codec can't encode character '\u810f' in position 67: illegal multibyte sequence
When I set encoding in open:
with open(file_name, 'w', encoding="utf-8") as f:
It still prints out 2 file_name, and the second one is still empty.
C:/json_n/肝動脈栓塞術.json
C:/json_n/心臟電氣生理學檢查注意事項(簡體中文).json
Error:
OSError Traceback (most recent call last)
<ipython-input-44-256bcf14fcbe> in <module>()
18 save_path= 'C:/json_n/'
19 file_name=save_path+'%s.json' % title_tag.text
---> 20 with open(file_name, 'w', encoding="utf-8") as f:
21 print(file_name)
22 file = json.dumps(data,ensure_ascii=False)
OSError: [Errno 22] Invalid argument: 'C:/json_n/如何使用胰島素空針抽取短效型胰島素?.json'