3

I Have a bucket in S3 with a csv in it.
There are no none-ASCII characters in it.
when I try to read it using python it will not let me.
I used: df = self.s3_input_bucket.get_file_contents_from_s3(path)
as I used on many occasions recently in the same script, and get: UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 14: invalid start byte.
to make sure it goes to the right path, i put another plain text file in the same folder and was able to read it without a problem.

I tried many solutions I found on other questions. just one example, I saw a solution someone offered, to try this:

str = unicode(str, errors='replace')

or

str = unicode(str, errors='ignore')
from this question: UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c
but how can I use them in this case?
this did not work:

str = unicode(self.s3_input_bucket.get_file_contents_from_s3(path), errors='replace')

Zusman
  • 606
  • 1
  • 7
  • 31
  • 1
    The file is not encoded with UTF-8. You need to tell the S3 library to use a different codec. Which library are you using? A search for "AWS S3" returns multiple matches on PyPI. – lenz Feb 03 '19 at 10:58
  • 1
    This is one of the many weaknesses with the CSV formats. As with all text files, you have to read it with the character encoding it was written with. If you don't know which it is then there is a failed communication. Can you ask the writer or refer to documents or see the HTTP headers, ...? – Tom Blodget Feb 03 '19 at 14:44

2 Answers2

3

Apparently, I tried to open a zipped filed.
after much research, I was able to read it into a data frame using this code:

import zipfile
import s3fs
s3_fs = s3fs.S3FileSystem(s3_additional_kwargs={'ServerSideEncryption': 'AES256'})

market_score = self._zipped_csv_from_s3_to_df(os.path.join(my-bucket, path-in-bucket), s3_fs)

def _zipped_csv_from_s3_to_df(self, path, s3_fs):
    with s3_fs.open(path) as zipped_dir:
            with zipfile.ZipFile(zipped_dir, mode='r') as zipped_content:
                for score_file in zipped_content.namelist():
                    with zipped_content.open(score_file) as scores:
                        return pd.read_csv(scores)

I will always have only one csv file inside the zip, so that is why I know I can return on the first iteration.
however this function iterate over the files in the zip.

Zusman
  • 606
  • 1
  • 7
  • 31
2

The error message in the question actually related to a CSV encoding issue (quite separate from the title: "read zipped CSV from s3").

One possible solution to the title question is:

pd.read_csv('s3://bucket-name/path/to/zip/my_file.zip')

Pandas will open the zip and read in the CSV. This will only work if the zip contains a single CSV file. If there are multiple, another solution is required (perhaps more like OP's solution).

The encoding issue can be resolved by specifying the encoding type in the read. For example:

pd.read_csv('s3://bucket-name/path/to/zip/my_file.zip', encoding = "ISO-8859-1")

defraggled
  • 1,014
  • 11
  • 13