Via an SSH server I have access to a data set. This data set is divided into several files, each of which is named File1.xml.gz
, File2.xml.gz
, etc. ... The naming of these files is a bit misleading in two ways:
Since it is a folder, I assume that it is strictly speaking a
.tar.gz
file, but this is not obvious from the name (it only says.gz
).When you unzip them, you don't get
File1.xml
etc. directly, but they all contain each a first (sub)folder (and nothing else), which in turn contains a second subfolder (and nothing else), this one a third subfolder (and nothing else) and this one finally contains the fourth subfolder, in whichFile1.xml
(and nothing else) is located.I have sketched this in a picture of the folder structure:
It is exactly this file in the lowest level that I want to access.
My problem: I am not allowed to delete the (apparently superfluous) folders and there is hardly any space left on the server and the files are extremely large, so I can't just unpack them. Therefore I wanted to read in the contents of the files line by line.
I think I know how to find a file that is embedded in several subfolders:
for root, dirs, files in os.walk(directory, topdown=False):
for file in files:
if file.startswith('file') and file.endswith('.xml'):
# do something with file
And I know how to read a zipped file without explicitly unzipping it:
with gzip.open('path to file1.xml.gz', 'rt', encoding='utf-8') as file1:
for line in file1:
print(line)
But accessing a file that's in the sub-sub-sub-folder of a zipped folder? Is that possible?