0

Posting again as the previous post had the API token in it. I am scraping data from a website: Here is the code:

reload(sys)
sys.setdefaultencoding('utf-8-sig')

def __unicode__(self):
   return unicode(self.some_field) or u''
def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n+1)
#def is_ascii(s):
    #return all(ord(c) < 128 for c in s)
date=''
min_date=''
max_date=''
if sys.argv[1] == 'today':
    min_date = datetime.today() - timedelta(1)
    max_date = datetime.today()
elif sys.argv[1] == 'yesterday':
    min_date = datetime.today() - timedelta(2)
    max_date = datetime.today() - timedelta(1)
else:
    min_date = datetime.strptime(sys.argv[1], "%Y-%m-%d") - timedelta(1)
    max_date = datetime.strptime(sys.argv[2], "%Y-%m-%d")
siteIDs = [37]
for id in siteIDs:
    for date in daterange(min_date, max_date):
        response_data = {}
        url = 'http://survey.modul.ac.at/piwikAnalytics/?module=API&method=Live.getLastVisitsDetails&idSite=' + str(id) + '&format=csv&token_auth=' + token_auth + '&period=day&date=' + date.strftime('%Y-%m-%d') + '&filter_limit=2000'
        try:
            response=requests.get(url,timeout=100)
            response_url=response.url
            response_data=urllib.urlopen(url)

        except (requests.exceptions.Timeout,requests.exceptions.RequestException,requests.exceptions.HTTPError,requests.exceptions.ConnectionError,socket.error) as e  :
            response_data="error"
        with codecs.open('raw_csv/piwik_'+ str(id) + '_' + date.strftime('%Y-%m-%d')+ '.csv', 'wb',encoding='utf-8-sig') as fp: 
                fp.write(response.text)

In the output a column 'idSite' is being shown as 'idSite'. I tried to remove it by the following code:

import pandas as pd

df = pd.read_csv("piwik_37_2016-07-08.csv", dtype = "unicode", encoding="utf-8-sig")
df.to_csv("abc.csv")

But i am getting the above mentioned Unicode error

Diganta Bharali
  • 243
  • 2
  • 3
  • 9
  • Possible duplicate of [Is there an easy way to make unicode work in python?](http://stackoverflow.com/questions/12556839/is-there-an-easy-way-to-make-unicode-work-in-python) – l'L'l Jul 10 '16 at 01:24
  • You get unicode values when you perform the read operation. These unicode chars need to be converted into bytes before writing it to a csv file. So, you must do this: `df.to_csv("abc.csv", encoding='utf-8')` – Nickil Maveli Jul 10 '16 at 10:02

1 Answers1

0

One brute force way to remove all non-ASCII characters from a string is:

import re
# substitute sequence of non-ASCII characters with single space
str = re.sub(r'[^\x00-\x7F]+',' ', str)

Hope that helps in your case

Kevin
  • 901
  • 1
  • 7
  • 15