0

I am having trouble while extracting Chinese text and writing it into a file.

str = "全球紧张致富豪财富缩水 贝索斯丁磊分列跌幅前两位";
f=open('test.txt','w');
f.write(str);

above code runs fine. while writing to file in below code showing gibberish.

import requests;
from bs4 import BeautifulSoup

f=open('data.txt','w');

def techSinaCrawler():
    url="http://tech.sina.com.cn/"
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")
    for li in soup.findAll('li',{'data-sudaclick': 'yaowenlist-1'}):
        for link in li.findAll('a'):
            href = link.get('href')
            techSinaInsideLinkCrawler(href);            

def techSinaInsideLinkCrawler(url):

    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")
    for data in soup.findAll('h1',{'id': 'main_title'}):
        str='main_title'+':'+ data.string
        f.write(str);
        f.write('\n');

techSinaCrawler();

Thanks for the help

  • What character set are you using? – Jay Aug 11 '17 at 17:26
  • the website used UTF-8 character set – Zain Danish Aug 11 '17 at 17:31
  • [This](https://stackoverflow.com/questions/20205455/how-to-correctly-parse-utf-8-encoded-html-to-unicode-strings-with-beautifulsoup) and [this](https://stackoverflow.com/questions/7219361/python-and-beautifulsoup-encoding-issues) might be useful dealing with BeautifulSoup encoding issues. – Ramon Aug 11 '17 at 19:19

2 Answers2

0

In Python 2, it's a good idea to use codecs.open() if you're dealing with encodings other than ASCII. That way, you don't need to manually encode everything you write. Also, os.walk() should be passed a Unicode string if you're expecting non-ASCII characters in the filenames:

import codecs
with codecs.open("c:/Users/me/filename.txt", "a", encoding="utf-8") as d:
   for dir, subdirs, files in os.walk(u"c:/temp"):
      for f in files:
         fname = os.path.join(dir, f)
         print fname
         d.write(fname + "\n")

No need to call d.close(), the with block already takes care of that.

Akshay Prabhakar
  • 430
  • 4
  • 13
0

Solved..

Just changed the .text to .content

plain_text = source_code.text to plain_text = source_code.content

to get the output as Chinese text.

Got the desired result