0

I am downloading a csv file from gmail attachment using below code. The output file generated looks like ^I delimited and I am not sure how to change it to some other delimiter so that I can load it into a pandas dataframe.

message = service.users().messages().get(userId=user_id, id=msg_id).execute()
for part in message['payload'].get('parts', ''):
    if part['filename']:
        file_name = part['filename']
        att_id = part['body']['attachmentId']
        att = service.users().messages().attachments().get(userId=user_id, messageId=msg_id,
                                                               id=att_id).execute()
        data = att['data']
file_data = base64.urlsafe_b64decode(data.encode('UTF-8'))
with open(path, 'wb') as f:
    f.write(file_data)

Output file content is separated like below - Location ID^IProcessed Currency Code^IExternal MID^IDBA Name

Edit 1: Now I am using tab as separator but get a different error while reading file into dataframe. Perhaps, it has got something to do with urlsafe_b64decode or data.encode('UTF-8') but I am not able to understand. The actual attachment file in gmail is a tab separated file.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

conetfun
  • 1,605
  • 4
  • 17
  • 38
  • The actual column names are Location ID, Processed Currency Code, External MID, DBA Name – conetfun May 12 '19 at 18:53
  • You can specify the separator while reading a csv with pandas: `pd.read_csv(filename, sep='^|')` – Erfan May 12 '19 at 18:58
  • Possible duplicate of [How to load a tsv file into a Pandas DataFrame?](https://stackoverflow.com/questions/9652832/how-to-load-a-tsv-file-into-a-pandas-dataframe) – snakecharmerb May 12 '19 at 18:59
  • 1
    `^I` is another way of representing a tab, so this is likely a tab-separated file. See the dupe target for how to load. – snakecharmerb May 12 '19 at 19:00
  • @Erfan - I tried that but it is not able to find ^I in file and puts everything into a single column – conetfun May 12 '19 at 19:00
  • Then see the question that is tagged by @snakecharmerb, its `sep='\t'` – Erfan May 12 '19 at 19:02
  • @snakecharmerb - Tried using df = pd.read_csv(path, header=0, sep='\t') and get following error - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte – conetfun May 12 '19 at 19:02

1 Answers1

2

There are two issues here:

  • the column delimiter '^I'
  • the encoding of the data

The delimiter

'^I' is another way of representing the tab character. Some applications may choose to represent tab characters in this way rather than using the '\t' escape sequence. For example vim displays tabs as '^I' when instructed to display non-printing characters.

Encoding

'0xff' is part of a byte-order-mark or BOM - an invisible character sequence used to tell applications about the encoding used to encode text. The UTF-16 encoding uses the sequence '0xfe0xff' to denote big-endian UTF-16 (Python codec 'utf-16-be'). '0xff0xfe' denotes little-endian UTF-16 (Python codec 'utf-16-le'). UTf-16 uses two bytes to encode characters, so the order of the bytes is important.

So, to successfully open the attachment, it's necessary to specify that the column delimiter is a tab, and the encoding is UTF-16 (Python can use the BOM to work out which version of UTF-16):

pd.read_csv(path, header=0, sep='\t', encoding='utf-16')
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153