1

Via an SSH server I have access to a data set. This data set is divided into several files, each of which is named File1.xml.gz, File2.xml.gz, etc. ... The naming of these files is a bit misleading in two ways:

  1. Since it is a folder, I assume that it is strictly speaking a .tar.gz file, but this is not obvious from the name (it only says .gz).

  2. When you unzip them, you don't get File1.xml etc. directly, but they all contain each a first (sub)folder (and nothing else), which in turn contains a second subfolder (and nothing else), this one a third subfolder (and nothing else) and this one finally contains the fourth subfolder, in which File1.xml (and nothing else) is located.

    I have sketched this in a picture of the folder structure:

    visualization of the folder structure

    It is exactly this file in the lowest level that I want to access.

My problem: I am not allowed to delete the (apparently superfluous) folders and there is hardly any space left on the server and the files are extremely large, so I can't just unpack them. Therefore I wanted to read in the contents of the files line by line.

I think I know how to find a file that is embedded in several subfolders:

for root, dirs, files in os.walk(directory, topdown=False):
    for file in files:
        if file.startswith('file') and file.endswith('.xml'):
            # do something with file

And I know how to read a zipped file without explicitly unzipping it:

with gzip.open('path to file1.xml.gz', 'rt', encoding='utf-8') as file1:
    for line in file1:
        print(line)

But accessing a file that's in the sub-sub-sub-folder of a zipped folder? Is that possible?

martineau
  • 119,623
  • 25
  • 170
  • 301
Gjanetta
  • 37
  • 8
  • 1
    You probably want the tarfile module (which can transparently support the gzipping). I'm not familiar enough with either of those things to know whether it needs to entirely decompress in memory to accommodate those operations, so hopefully somebody else can chime in with a full answer. – Hans Musgrave Jul 24 '20 at 00:03
  • I think your question is similar to [reading tar file contents without untarring it, in python script](https://stackoverflow.com/questions/2018512/reading-tar-file-contents-without-untarring-it-in-python-script) – 정도유 Jul 24 '20 at 01:55
  • Thanks for the hint, @정도유. Yeah, it seems similar. But with my level of knowledge, I can't apply the solutions there to my problem. – Gjanetta Jul 24 '20 at 09:38
  • Thank you for editing my question @martineau – Gjanetta Jul 24 '20 at 09:44

1 Answers1

4

Use tarfile, opening with mode "r|gz". Use next() until you get to what you want, then extractfile() on that member to return a buffered stream you can read from.

>>> import tarfile
>>> t = tarfile.open("file.gz","r|gz")
>>> t.next()
<TarInfo 'a' at 0x1044d3b38>
>>> t.next()
<TarInfo 'a/b' at 0x1044d39a8>
>>> t.next()
<TarInfo 'a/b/c' at 0x1044d38e0>
>>> t.next()
<TarInfo 'a/b/c/d' at 0x1044d3a70>
>>> m = t.next()
>>> m.name
'a/b/c/d/file'
>>> f = t.extractfile(m)
>>> f.readline()
b'this\n'
>>> f.readline()
b'is\n'
>>> f.readline()
b'a\n'
>>> f.readline()
b'test\n'
>>> f.readline()
b''
Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Thank you very much for your suggestion, I just tried it. But it looks likeI have done something wrong: I keep getting this error `tarfile.StreamError: seeking backwards is not allowed` at the third `>>> t.next()` – Gjanetta Jul 24 '20 at 09:30
  • @Gjanetta: Try using `'r:gz'` instead (just a guess). – martineau Jul 24 '20 at 10:21
  • Thanks @martineau. That's just the way I used to know it (`'r:gz'` instead of `'r|gz'`), too. But that | symbol seems to have a point: If I replace that, I get an `"AttributeError: 'NoneType' object has no attribute 'isreg'"` – Gjanetta Jul 24 '20 at 11:10
  • 1
    Perhaps the file and folder structure is not what you think. Try using `t.getmembers()` right after opening the file and check its contents. – martineau Jul 24 '20 at 12:30
  • You do not have to and you do not want to seek backwards. Just use `next()` until you get to the entry you want. A tar file is a flat, sequential structure. You must be doing something beyond what is in my example that is causing it to go forwards more than necessary. – Mark Adler Jul 24 '20 at 16:03
  • @martineau I read the python docs for tarfile (https://docs.python.org/3/library/tarfile.html) and found the difference between `'r:gz'` and `'r|gz'`: `'r:gz'` >> Open for reading with gzip compression. - `'r|gz'` >> Open a gzip compressed stream for reading. According to this, `' r|gz'` is certainly right for my purposes. – Gjanetta Jul 24 '20 at 18:09
  • 1
    @MarkAdler It's not like I actively initialized any backward seek, but rather this error message came when I included your suggestion in my script. According to [this answer] (https://stackoverflow.com/a/18624269/12899648) to someone else’s question it seems, that I had already read through the entire file at some point in my code, when I was trying to do it again (without closing and reopening the tar archive before). – Gjanetta Jul 24 '20 at 18:16
  • I think the repeated `t.next()` is what is „reading“ through the tar archive. Because: When I executed `t.getmembers()` I got exactly one element (only): `[]`) - **folders are obviously no `members`!** I looked up what `TarFile.next()` does: Return the next member of the archive as a TarInfo object, when TarFile is opened for reading. Return None if there is no more available. – Gjanetta Jul 24 '20 at 18:17
  • And I saw in the debugger that all `t.next()`, that, indeed, for all of the empty folders, I got `None`. **The first call of `t.next()` already returns the file I want to access.** As it is the only file in the folder structure, any subsequent call of `t.next()`doesn't make sense. Together with these findings, @MarkAdler 's proposal is my solution. I have reduced `t.next()` to one call. Thank you! – Gjanetta Jul 24 '20 at 18:18
  • What I _said_ is to use `next()` until you get to what you want. – Mark Adler Jul 24 '20 at 20:07
  • That's right. It's just that your example looks to me like `next()` also returns `TarInfo objects` *for folders* (a, b, c, d in the example). But I can't reproduce that. With `next()` I only get back `TarInfo objects` *for files*. And that's why I added the comment above - I thought it might be helpful for future readers of this question, who might be as inexperienced as I am. I'm sorry if I misunderstood anything. Your answer has helped a good deal, thanks! – Gjanetta Jul 24 '20 at 21:23
  • Tar files very commonly have entries for directories as well as files, symbolic links, and sometimes other more esoteric objects. As shown by my example, made by tar with default options. So to make your code general you should indeed loop on `next()` until you get a name that matches what you are looking for. As opposed to assuming that it is always the first entry. – Mark Adler Jul 24 '20 at 21:54
  • It's strange that tar behaves differently with me. But thanks for this tip - this way my code will be more robust and still does what I want. I marked your answer as accepted. – Gjanetta Jul 26 '20 at 15:14