1

I have seen this question but I need something else.

My files contains a very large amount of text files (hundreds of thousands) organized by variable name. Something like

filename/maxvalue/IDXstation.txt     (with X that goes from 100000 to 200000)
filename/minvalue/IDXstation.txt  
filename/meanvalue/IDXstation.txt 

and so on. Problem is that I don't have a readme.txt files that tells me how many folders are in the tar files or how they are named (I made them up) (or how many stations are in each folder). For now all I care to read is the structure of the filename.tar.gz and print something like

filename/maxvalue/  
filename/minvalue/  
filename/meanvalue/

I need to read the structure of it before I start extracting the file, because I am interested only in some folders and not all of them.

if I use

for tarinfo in tar:
    print tarinfo.name

It will print all the files, and they are hundreds of thousands and I don't want that, but I am not sure how to set it up.

Community
  • 1
  • 1
claude
  • 549
  • 8
  • 25
  • Do you want to print all directories names in the archive? What folders you are interested in? – jfs Feb 06 '15 at 21:15
  • Yes, the directories names up to the second level (makes sense?) filename/variablename/ – claude Feb 06 '15 at 21:29
  • 1
    If its just about finding the structure, I suggest you should use standard command line tools. In any case, you need to unzip the data stream, there is no way around. After doing this, the `tar` command provides plenty of options to have a "peek" into the archive. – Dr. Jan-Philip Gehrcke Feb 06 '15 at 21:34
  • thanks - that seems reasonable I hadn't thought about it. – claude Feb 06 '15 at 21:48
  • @chiara what did you mean by "meanvalue" in your example? Is it just some name located in the middle of a long list of names? – artemdevel Feb 06 '15 at 21:49
  • no the meanvalue was just a made-up name for the folder. – claude Feb 07 '15 at 00:34

2 Answers2

2

The wikipedia page on tar says to list the names of the files that are in the archive, one must read through the entire archive and look for places where files start. So you will have to untar the datastream to get the file names. One simple way to only print only expected names would be to use a regex to keep only the relevant directory names. If you are sure that the directory themselves are registered in the tar file something like that should be enough :

import re

rx = re.compile('[^/]+\/[^/]\/?$')
...
for tarinfo in tar:
    if rx.match(tarinfo.name):
        print tarinfo.name

If you are not sure that expexted directories are registered in tarfile, you can use a less strict match an put the directory part in a set. Something like :

import re

rx = re.compile('([^/]+\/[^/])\/')
...
names = set()
for tarinfo in tar:
    if rx.match(tarinfo.name):
        names.add(tarinfo.name)
for name in names:
    print name
        print tarinfo.name
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
1

To print top level directories in the tar archive e.g., upto the second level:

#!/usr/bin/env python
import sys
import tarfile

with tarfile.open(sys.argv[1]) as archive:
    for member in archive:
        if member.isdir() and member.name.count('/') < 2:
            print(member.name)

Usage:

$ print-top-level-dirs <tar-archive>
jfs
  • 399,953
  • 195
  • 994
  • 1,670