0

I'm trying to download and save a .csv file I receive daily in my outlook inbox. The code saves the file in the a local folder. When I try to access it using pandas, I get a 'utf-8' codec error

import os
from enum import Enum
import win32com.client as win32
from datetime import datetime
import pandas as pd

class OutlookFolder(Enum):
    olFolderInbox = 6

outlook = win32.Dispatch("Outlook.Application").GetNamespace("MAPI")

inbox = outlook.GetDefaultFolder(OutlookFolder.olFolderInbox.value)

messages = inbox.Items
messages = messages.Restrict("[SenderEmailAddress] = ABCD@XYZ.com")

t = datetime.today().date()

outputDir = r"C:\Users\ABCD\Documents\ABCD\Daily Seed File"
try:
    for message in list(messages):
        if message.ReceivedTime.date() == t:
            try:
                s = message.sender
                for attachment in message.Attachments:
                    fn = attachment.FileName[:-4] + "_" + str(message.ReceivedTime.date()) + ".csv"
                    attachment.SaveASFile(os.path.join(outputDir,fn))
                    print(f"attachment {attachment.FileName} from {s} saved")
            except Exception as e:
                print("error when saving the attachment:" + str(e))
except Exception as e:
    print("error when processing email messages:" + str(e))

fp = outputDir + "\\" + fn

df = pd.read_csv(fp)
df.head()

I'm getting the following error -

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Eugene Astafiev
  • 47,483
  • 3
  • 24
  • 45
ask1504
  • 27
  • 3
  • Does this answer your question? [error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte](https://stackoverflow.com/questions/42339876/error-unicodedecodeerror-utf-8-codec-cant-decode-byte-0xff-in-position-0-in) – Paul Jun 29 '22 at 14:50
  • That means the file is *NOT* a UTF8 file. It may be UTF8 with a BOM, or UTF16. Try loading it with `encoding='utf-8-sig'` or `encoding='utf-16'` – Panagiotis Kanavos Jun 29 '22 at 14:51
  • @Paul that's not a helpful question. Pandas already tried using UTF8, so the accepted answer isn't useful. Ignoring errors isn't useful either, so the second is wrong. And this is definitely not Latin1, which means the 3rd is wrong too. It's quite likely this is a UTF16 LE file with a [Byte Order Mark](https://en.wikipedia.org/wiki/Byte_order_mark), which *does* start with 0xFF. – Panagiotis Kanavos Jun 29 '22 at 14:55
  • @PanagiotisKanavos, using ```encoding='utf-16'``` helped but the it is combining all the columns into one and all rows into a binary output - 0/1. Any suggestions? – ask1504 Jun 29 '22 at 15:06
  • What does `binary output` mean? UTF16 is just text. Have you tried opening that file? What's the column separator? Is it `;` or a tab perhaps? – Panagiotis Kanavos Jun 29 '22 at 15:08
  • Is that file an actual CSV file? Your code is stripping the extension and appending `.csv`. Perhaps it's an Excel file or a ZIP file. What does `attachment.FileName` contain? At the very least preserve the original extension instead of replacing it – Panagiotis Kanavos Jun 29 '22 at 15:09
  • I added another arg to ```pandas.read_csv()``` namely ```sep="\t"``` and it worked – ask1504 Jun 29 '22 at 15:23

2 Answers2

0

You could try to use the encoding parameter:

df = pd.read_csv(fp, encoding='utf8')
svfat
  • 3,273
  • 1
  • 15
  • 34
0

The file had a UTF-16 encoding and tab or \t separator. The following works -

df = pd.read_csv(fp, encoding ='utf-16', sep="\t")
ask1504
  • 27
  • 3