reading tar file contents without untarring it, in python script

Question

I have a tar file which has number of files within it. I need to write a python script which will read the contents of the files and gives the count o total characters, including total number of letters, spaces, newline characters, everything, without untarring the tar file.

How can you count the characters/letters/spaces/everthing without extracting those to somewhere else? — YOU, Jan 07 '10 at 06:17

score 146 · Accepted Answer · edited Dec 20 '19 at 16:16

146

you can use getmembers()

>>> import  tarfile
>>> tar = tarfile.open("test.tar")
>>> tar.getmembers()

After that, you can use extractfile() to extract the members as file object. Just an example

import tarfile,os
import sys
os.chdir("/tmp/foo")
tar = tarfile.open("test.tar")
for member in tar.getmembers():
    f=tar.extractfile(member)
    content=f.read()
    print "%s has %d newlines" %(member, content.count("\n"))
    print "%s has %d spaces" % (member,content.count(" "))
    print "%s has %d characters" % (member, len(content))
    sys.exit()
tar.close()

With the file object f in the above example, you can use read(), readlines() etc.

edited Dec 20 '19 at 16:16

phoenix

7,988
6
39
45

answered Jan 07 '10 at 06:17

ghostdog74

327,991
56
259
343

27

"for member in tar.getmembers()" can be changed to "for member in tar" which is either a generator or an iterator (I'm not sure which). But it gets a member one at a time. – huggie Dec 28 '11 at 09:24
2

I just had a similar problem, but the tarfile module seems to eat my ram, even though I used the `'r|'` option. – devsnd May 21 '12 at 17:39
2

Ah. I solved it. Assuming you would write the code as hinted by huggie, you have to "clean" the list of members once in a while. So given the code example above, that would be `tar.members = []`. More Info here: http://bit.ly/JKXrg6 – devsnd May 21 '12 at 17:45
will `tar.getmembers()` be called multiple times when put it in a `for member in tar.getmembers()` loop? – Haifeng Zhang Mar 04 '15 at 16:31
what if the files are nested into subfolders? normal os.path.isdir() operations don't work – Jan 13 '17 at 22:43
1

After you do "f=tar.extractfile(member)", do you need to also close f? – bolei May 11 '17 at 00:06
This solution poses a security issue. [Check this bug](https://bugs.python.org/issue21109) for more information. – MeanEYE Oct 10 '18 at 11:27
Since `extractfile` doesn't provide an `encoding` attribute, if you need a text stream, you can do `f = codecs.getreader("utf-8")(f)`. – Thomas Ahle Jul 17 '19 at 21:39
this solution has bad performance when the tar file have a lot of files inside. It needs to resolve all the file name first. – Jingpeng Wu Nov 01 '19 at 15:40

score 14 · Answer 2 · answered Jan 07 '10 at 06:01

you need to use the tarfile module. Specifically, you use an instance of the class TarFile to access the file, and then access the names with TarFile.getnames()

 |  getnames(self)
 |      Return the members of the archive as a list of their names. It has
 |      the same order as the list returned by getmembers().

If instead you want to read the content, then you use this method

 |  extractfile(self, member)
 |      Extract a member from the archive as a file object. `member' may be
 |      a filename or a TarInfo object. If `member' is a regular file, a
 |      file-like object is returned. If `member' is a link, a file-like
 |      object is constructed from the link's target. If `member' is none of
 |      the above, None is returned.
 |      The file-like object is read-only and provides the following
 |      methods: read(), readline(), readlines(), seek() and tell()

ThorSummoner · Answer 3 · 2021-01-30T21:48:58.427

Previously, this post showed an example of "dict(zip(()"'ing the member names and members lists together, this is silly and causes excessive reads of the archive, to accomplish the same, we can use dictionary comprehension:

index = {i.name: i for i in my_tarfile.getmembers()}

More info on how to use tarfile

Extract a tarfile member

#!/usr/bin/env python3
import tarfile

my_tarfile = tarfile.open('/path/to/mytarfile.tar')

print(my_tarfile.extractfile('./path/to/file.png').read())

Index a tar file

#!/usr/bin/env python3
import tarfile
import pprint

my_tarfile = tarfile.open('/path/to/mytarfile.tar')

index = my_tarfile.getnames()  # a list of strings, each members name
# or
# index = {i.name: i for i in my_tarfile.getmembers()}

pprint.pprint(index)

Index, read, dynamic extra a tar file

#!/usr/bin/env python3

import tarfile
import base64
import textwrap
import random

# note, indexing a tar file requires reading it completely once
# if we want to do anything after indexing it, it must be a file
# that can be seeked (not a stream), so here we open a file we
# can seek
my_tarfile = tarfile.open('/path/to/mytar.tar')


# tarfile.getmembers is similar to os.stat kind of, it will
# give you the member names (i.name) as well as TarInfo attributes:
#
# chksum,devmajor,devminor,gid,gname,linkname,linkpath,
# mode,mtime,name,offset,offset_data,path,pax_headers,
# size,sparse,tarfile,type,uid,uname
#
# here we use a dictionary comprehension to index all TarInfo
# members by the member name
index = {i.name: i for i in my_tarfile.getmembers()}

print(index.keys())

# pick your member
# note: if you can pick your member before indexing the tar file,
# you don't need to index it to read that file, you can directly
# my_tarfile.extractfile(name)
# or my_tarfile.getmember(name)

# pick your filename from the index dynamically
my_file_name = random.choice(index.keys())

my_file_tarinfo = index[my_file_name]
my_file_size = my_file_tarinfo.size
my_file_buf = my_tarfile.extractfile( 
    my_file_name
    # or my_file_tarinfo
)

print('file_name: {}'.format(my_file_name))
print('file_size: {}'.format(my_file_size))
print('----- BEGIN FILE BASE64 -----'
print(
    textwrap.fill(
        base64.b64encode(
            my_file_buf.read()
        ).decode(),
        72
    )
)
print('----- END FILE BASE64 -----'

tarfile with duplicate members

in the case that we have a tar that was created strangely, in this example by appending many versions of the same file to the same tar archive, we can work with that carefully, I've annotated which members contain what text, lets say we want the fourth (index 3) member, "capturetheflag\n"

tar -tf mybadtar.tar 
mymember.txt  # "version 1\n"
mymember.txt  # "version 1\n"
mymember.txt  # "version 2\n"
mymember.txt  # "capturetheflag\n"
mymember.txt  # "version 3\n"

#!/usr/bin/env python3

import tarfile
my_tarfile = tarfile.open('mybadtar.tar')

# >>> my_tarfile.getnames()
# ['mymember.txt', 'mymember.txt', 'mymember.txt', 'mymember.txt', 'mymember.txt']

# if we use extracfile on a name, we get the last entry, I'm not sure how python is smart enough to do this, it must read the entire tar file and buffer every valid member and return the last one

# >>> my_tarfile.extractfile('mymember.txt').read()
# b'version 3\n'

# >>> my_tarfile.extractfile(my_tarfile.getmembers()[3]).read()
# b'capturetheflag\n'

Alternatively we can iterate over the tar file #!/usr/bin/env python3

import tarfile
my_tarfile = tarfile.open('mybadtar.tar')
# note, if we do anything to the tarfile object that will 
# cause a full read, the tarfile.next() method will return none,
# so call next in a loop as the first thing you do if you want to
# iterate

while True:
    my_member = my_tarfile.next()
    if not my_member:
        break
    print((my_member.offset, mytarfile.extractfile(my_member).read,))

# (0, b'version 1\n')
# (1024, b'version 1\n')
# (2048, b'version 2\n')
# (3072, b'capturetheflag\n')
# (4096, b'version 3\n')

@KIC in my example above we have to read the file twice, once to index it (list all the files it contains) and a second time to extract the file we want by name, this is a consequence/feature of how tars are structured. If you need to extract a file from a tar you can only read once (like from a stream) then you must know the file name in advance. if you know the member name, and you only want to extract that one member, you can extract it on the first pass by using `myArchive.extractfile('my/member/name.png')` directly — ThorSummoner, Jan 30 '21 at 21:06

score 0 · Answer 4 · answered Apr 10 '20 at 10:12

0

you can use tarfile.list() ex :

filename = "abc.tar.bz2"
with open( filename , mode='r:bz2') as f1:
    print(f1.list())

after getting these data. you can manipulate or write this output to file and do whatever your requirement.

answered Apr 10 '20 at 10:12

ChandraShekhar Mahto

1
1
1

Chamara · Answer 5 · 2023-06-05T21:58:12.547

0

 import tarfile

 targzfile = "path to the file"

 tar = tarfile.open(targzfile)

 for item in tar.getnames():

     if "README.txt" in item:

       file_content = tar.extractfile(item).read()

       fileout = open("output file path", 'wb')

       fileout.write(file_content)

       fileout.close()

       break

edited Jun 05 '23 at 21:58

answered Jun 05 '23 at 21:53

Chamara

1
2

Thank you for contributing to the Stack Overflow community. This may be a correct answer, but it’d be really useful to provide additional explanation of your code so developers can understand your reasoning. This is especially useful for new developers who aren’t as familiar with the syntax or struggling to understand the concepts. **Would you kindly [edit] your answer to include additional details for the benefit of the community?** – Jeremy Caney Jun 06 '23 at 00:22

reading tar file contents without untarring it, in python script

5 Answers5

Extract a tarfile member

Index a tar file

Index, read, dynamic extra a tar file

tarfile with duplicate members

Linked

Related