Is it possible to loop through Amazon S3 bucket and count the number of lines in its file/key using Python?

Question

Is it possible to loop through the file/key in Amazon S3 bucket, read the contents and count the number of lines using Python?

For Example:

  1. My bucket: "my-bucket-name"
  2. File/Key : "test.txt"

I need to loop through the file "test.txt" and count the number of line in the raw file.

Sample Code:

for bucket in conn.get_all_buckets():
    if bucket.name == "my-bucket-name":
        for file in bucket.list():
            #need to count the number lines in each file and print to a log.

tamjd1 · Accepted Answer · 2016-05-31T04:55:46.903

Using boto3 you can do the following:

import boto3

# create the s3 resource
s3 = boto3.resource('s3')

# get the file object
obj = s3.Object('bucket_name', 'key')

# read the file contents in memory
file_contents = obj.get()["Body"].read()

# print the occurrences of the new line character to get the number of lines
print file_contents.count('\n')

If you want to do this for all objects in a bucket, you can use the following code snippet:

bucket = s3.Bucket('bucket_name')
for obj in bucket.objects.all():
    file_contents = obj.get()["Body"].read()
    print file_contents.count('\n')

Here is the reference to boto3 documentation for more functionality: http://boto3.readthedocs.io/en/latest/reference/services/s3.html#object

Update: (Using boto 2)

import boto
s3 = boto.connect_s3()  # establish connection
bucket = s3.get_bucket('bucket_name')  # get bucket

for key in bucket.list(prefix='key'):  # list objects at a given prefix
    file_contents = key.get_contents_as_string()  # get file contents
    print file_contents.count('\n')  # print the occurrences of the new line character to get the number of lines

Trouble is, I am not using a Boto 3.0. My version of Boto is 2.38.0. Hence cannot try the s3.Object methods. Another issue is my files are all in .gz format and its gets even worse when I try to use Key.open_read as a fd to gzip.GzipFile. It errs as AttributeError: 'str' object has no attribute 'tell' or 'seek' I was wondering if ther is any work around. — Renukadevi, May 31 '16 at 04:05
@Renukadevi, I updated my post to add an example for boto2. To decompress gzip data, you can probably use the zlib library, see example here: http://stackoverflow.com/a/2695575/4072706 Hope this helps. — tamjd1, May 31 '16 at 04:56
Thank a ton ., that was so simple., I am new to AWS and your solution helped a lot. — Renukadevi, May 31 '16 at 05:41

score 1 · Answer 2 · answered Jun 05 '19 at 15:12

Reading large files to memory sometimes is far from ideal. Instead you may find the following more of use:

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucketname', Key=fileKey)


nlines = 0
for _ in obj['Body'].iter_lines(): nlines+=1

print (nlines)

score 0 · Answer 3 · answered May 30 '16 at 08:01

0

Amazon S3 is only a storage service. You must get the file in order to perform actions on it (e.g. reading number of files).

answered May 30 '16 at 08:01

ItayD

533
6
23

score 0 · Answer 4 · edited May 23 '17 at 10:29

0

You can loops through a bucket using boto3 list_objects_v2. Because list_objects_v2 only list maximum of 1000 keys (even you specify MaxKeys), you must whether NextContinuationToken exist in the response dictionary, then specify ContinuationToken to read next page.

I wrote the sample code in some answer but I can't recall.

Then you use get_object() to read the file, and use simple line count code

(Update) If you need key in particular prefix name, then add the PREFIX filter.

edited May 23 '17 at 10:29

Community

1
1

answered May 30 '16 at 09:12

mootmoot

12,845
5
47
44

Hi thanks, May be i dint phrase my question in proper. I want to iterate through specific files in S3 and count the number of rows in it. – Renukadevi May 30 '16 at 09:47
@Renukadevi : Please clarify meaning of "specific" . Do you mean file with prefix? – mootmoot May 30 '16 at 09:52

Is it possible to loop through Amazon S3 bucket and count the number of lines in its file/key using Python?

4 Answers4