0

I am trying to read a tab separated value txt file in python that I extracted from AWS storage. (credentials censored for AWS with XXX)

import io
import pandas as pd
import boto3
import csv
from bioservices import UniProt
from sqlalchemy import create_engine
s3 = boto3.resource(
    service_name='s3',
    region_name='us-east-2',
    aws_access_key_id='XXX',
    aws_secret_access_key='XXX'
)

so thats simply for connecting to AWS. next when I run this code for reading a tab separated txt file that is stored in AWS

txt = s3.Bucket('compound-bioactivity-original-files').Object('helper-files/kinhub_human_kinase_list_30092021.txt').get()
txt_reader = csv.reader(txt, delimiter='\t')
for line in txt_reader:
    print(line)

I get this output which is not what what I am looking for. And using dialect='excel-tab' instead of delimiter='\t' gives me the same output as well

['ResponseMetadata']
['AcceptRanges']
['LastModified']
['ContentLength']
['ETag']
['VersionId']
['ContentType']
['Metadata']
['Body']
Inan Khan
  • 91
  • 6
  • What about detecting the lines with `\n` or `\br` and then iterating in it with a `\t` delimiter before appending the given lines to each other in an array or dataframe? – Mayeul sgc Jan 04 '22 at 02:55
  • **Side-note:** It is generally bad practice to store your credentials in your source code. Instead, store AWS credentials in a credentials file using the AWS CLI `aws configure` command, and boto3 will automatically find them. – John Rotenstein Jan 04 '22 at 05:18

1 Answers1

2

There are several issues with your code.

First, Object.get() does not return the contents of the Amazon S3 object. Instead, as per the Object.get() documentation, it returns:

{
    'Body': StreamingBody(),
    'AcceptRanges': 'string',
    'LastModified': datetime(2015, 1, 1),
    'ContentLength': 123,
    'ETag': 'string',
    'VersionId': 'string',
    'CacheControl': 'string',
    'ContentDisposition': 'string',
    ...
    'BucketKeyEnabled': True|False,
    'TagCount': 123,
}

You can see this happening by inserting print(txt) as a debugging line.

If you want to access the contents of the object, you would use the Body element. To retrieve the contents of the streaming body, you can use .read().

However, this comes back as a binary string since the object is treated as a binary file. In Python, you can convert it back to ASCII by using .decode('ascii'). See: How to convert 'binary string' to normal string in Python3?

Therefore, you would actually need to use:

txt = s3.Bucket('bucketname').Object('object.txt').get()['Body'].read().decode('ascii')

(If that seems too complex, then you could have simply downloaded the file to the local disk, then use the CSV Reader on the file -- it would have worked nicely without having to use get/read/decode.)

The next issue, is that the documentation for csv.reader says:

csv.reader(csvfile, dialect='excel', **fmtparams)
Return a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called

Since the decode() command returns a string, then the for loop will iterate over individual characters in the string, not lines within the string.

Frankly, you could process the lines without using the CSV Reader, simply by splitting on the lines and the tabs, like this:

import boto3

s3 = boto3.resource('s3')

txt = s3.Bucket('bucketname').Object('object.txt').get()['Body'].read().decode('ascii')

lines = txt.split('\n')

for line in lines:
    fields = line.split('\t')
    print(fields)

All of the above issues should have been noticeable by adding some debugging to see whether each step was returning the data that you expected, such as printing the contents of the variables after each step.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470