Extract specific file extensions from multiple 7-zip files

Question

I have a RAR file and a ZIP file. Within these two there is a folder. Inside the folder there are several 7-zip (.7z) files. Inside every 7z there are multiple files with the same extension, but whose names vary.

RAR or ZIP file
  |___folder
        |_____Multiple 7z
                  |_____Multiple files with same extension and different name

I want to extract just the ones I need from thousands of files... I need those files whose names include a certain substring. For example, if the name of a compressed file includes '[!]' in the name or '(U)' or '(J)' that's the criteria to determine the file to be extracted.

I can extract the folder without problem so I have this structure:

folder
   |_____Multiple 7z
                |_____Multiple files with same extension and different name

I'm in a Windows environment but I have Cygwin installed. I wonder how can I extract the files I need painlessly? Maybe using a single command line line.

Update

There are some improvements to the question:

The inner 7z files and their respective files inside them can have spaces in their names.
There are 7z files with just one file inside of them that doesn't meet the given criteria. Thus, being the only possible file, they have to be extracted too.

Solution

Thanks to everyone. The bash solution was the one that helped me out. I wasn't able to test Python3 solutions because I had problems trying to install libraries using pip. I don't use Python so I'll have to study and overcome the errors I face with these solutions. For now, I've found a suitable answer. Thanks to everyone.

"I want to extract just the ones I need ..." How do you determine the ones you need? — Tony, Jan 26 '17 at 18:57
@Tony My bad... I've updated the question with the criteria. Basically a substring in the name of the compressed file. Thanks for your interest. — Metafaniel, Jan 26 '17 at 19:09

Borys Serebrov · Accepted Answer · 2017-11-14T11:00:37.463

This solution is based on bash, grep and awk, it works on Cygwin and on Ubuntu.

Since you have the requirement to search for (X) [!].ext files first and if there are no such files then look for (X).ext files, I don't think it is possible to write some single expression to handle this logic.

The solution should have some if/else conditional logic to test the list of files inside the archive and decide which files to extract.

Here is the initial structure inside the zip/rar archive I tested my script on (I made a script to prepare this structure):

folder
├── 7z_1.7z
│   ├── (E).txt
│   ├── (J) [!].txt
│   ├── (J).txt
│   ├── (U) [!].txt
│   └── (U).txt
├── 7z_2.7z
│   ├── (J) [b1].txt
│   ├── (J) [b2].txt
│   ├── (J) [o1].txt
│   └── (J).txt
├── 7z_3.7z
│   ├── (E) [!].txt
│   ├── (J).txt
│   └── (U).txt
└── 7z 4.7z
    └── test.txt

The output is this:

output
├── 7z_1.7z           # This is a folder, not an archive
│   ├── (J) [!].txt   # Here we extracted only files with [!]
│   └── (U) [!].txt
├── 7z_2.7z
│   └── (J).txt       # Here there are no [!] files, so we extracted (J)
├── 7z_3.7z
│   └── (E) [!].txt   # We had here both [!] and (J), extracted only file with [!]
└── 7z 4.7z
    └── test.txt      # We had only one file here, extracted it

And this is the script to do the extraction:

#!/bin/bash

# Remove the output (if it's left from previous runs).
rm -r output
mkdir -p output

# Unzip the zip archive.
unzip data.zip -d output
# For rar use
#  unrar x data.rar output
# OR
#  7z x -ooutput data.rar

for archive in output/folder/*.7z
do
  # See https://stackoverflow.com/questions/7148604
  # Get the list of file names, remove the extra output of "7z l"
  list=$(7z l "$archive" | awk '
      /----/ {p = ++p % 2; next}
      $NF == "Name" {pos = index($0,"Name")}
      p {print substr($0,pos)}
  ')
  # Get the list of files with [!].
  extract_list=$(echo "$list" | grep "[!]")
  if [[ -z $extract_list ]]; then
    # If we don't have files with [!], then look for ([A-Z]) pattern
    # to get files with single letter in brackets.
    extract_list=$(echo "$list" | grep "([A-Z])\.")
  fi
  if [[ -z $extract_list ]]; then
    # If we only have one file - extract it.
    if [[ ${#list[@]} -eq 1 ]]; then
      extract_list=$list
    fi
  fi
  if [[ ! -z $extract_list ]]; then
    # If we have files to extract, then do the extraction.
    # Output path is output/7zip_archive_name/
    out_path=output/$(basename "$archive")
    mkdir -p "$out_path"
    echo "$extract_list" | xargs -I {} 7z x -o"$out_path" "$archive" {}
  fi
done

The basic idea here is to go over 7zip archives and get the list of files for each of them using 7z l command (list of files).

The output of the command if quite verbose, so we use awk to clean it up and get the list of file names.

After that we filter this list using grep to get either a list of [!] files or a list of (X) files. Then we just pass this list to 7zip to extract the files we need.

HI! I liked your approach. I tested the code making the necessary changes to the file name and path. In my real life file there are some issues: it seems to be your script doesn't take into consideration files with spaces in the file name, as all the output thrown is from spaceless file names. Also there are cases where the 7z file has just one file inside it that doesn't meet the criteria. Then, this file has to be extracted as it's the only one available... If you could help me improve the answer, that'd be awesome. I'll update the question with this too. Thanks for helping me out. — Metafaniel, Nov 13 '17 at 23:32
A side note: About `cygwin` it seems to be `unrar` is unavailable. I read `7z` works fine with RAR files. I made the necessary changes to the script, yet a `System ERROR: Unknown error -2147024872` appears. I have no time to debug it now to understand why. — Metafaniel, Nov 13 '17 at 23:40
I added the case to handle single file archives and also found and fixed a problem with spaces in archive names. Did you also have problems with spaces in file names (I've tested it on files with spaces, see my examples of input/output). I also updated the [scripts on github](https://github.com/serebrov/so-questions/tree/master/bash_extract). — Borys Serebrov, Nov 14 '17 at 07:24
If I am able to access a windows machine soon, I'll try to test it on cygwin too. — Borys Serebrov, Nov 14 '17 at 07:31
I've just tested it and it works on Cygwin too, to extract rar archive use `7z x -ooutput data.rar`. — Borys Serebrov, Nov 14 '17 at 11:01
THANKS your solution has already solved my problem. Greetings and good day! — Metafaniel, Nov 14 '17 at 18:26

score 1 · Answer 2 · answered Nov 09 '17 at 08:36

1

What about using this command line :

7z -e c:\myDir\*.7z -oc:\outDir "*(U)*.ext" "*(J)*.ext" "*[!]*.ext" -y

Where :

myDir is your unzip folder
outDir is your output directory
ext is your file extension

The -y option is for forcing overwriting in case you have the same filename in different archives.

answered Nov 09 '17 at 08:36

jBravo

873
1
9
28

1

Thanks Johnny. The basic idea works. I tested this in Cygwin, the only difference is I need to remove `-` from `-e` for it to work. However I would have liked a way to implement the logic about `[!]` priority over other code combinations. I got much more results than expected this way. Maybe a REGEX is needed to be able to narrow the resuts? Thanks in your interest in helping me! – Metafaniel Nov 13 '17 at 22:08

Michał Zaborowski · Answer 3 · 2017-11-15T11:19:11.603

This is somehow final version after some tries. Previous was not useful so I'm removing it, instead of appending. Read till the end, since not everything may be needed for final solution.

To the topic. I would use Python. If that is one time task, then it can be overkill, but in any other case - you can log all steps for future investigation, regex, orchestrating some commands with providing input, and taking and processing output - each time. All that cases are quite easy in Python. If you have it however.

Now, I'll write what to do to have env. configured. Not all is mandatory, but trying install did some steps, and maybe description of the process can be beneficial itself.

I have MinGW - 32 bit version. That is not mandatory to extract 7zip however. When installed go to C:\MinGW\bin and run mingw-get.exe:

Basic Setup I have msys-base installed (right click, mark for installation, from Installation menu - Apply changes). That way I have bash, sed, grep, and many more.
In All Packages there is mingw32-libarchive with dll as class. Since pythonlibarchive` package is just a wrapper you need this dll to actually have binary to wrap.

Examples are for Python 3. I'm using 32 bit version. You can fetch it from their home page. I have installed in default directory which is strange. So advise is to install in root of your disk - like mingw.

Other things - conemu is much better then default console.

Installing packages in Python. pip is used for that. From your console go to Python home, and there is Scripts subdirectory there. For me it is: c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\Scripts. You can search with for instance pip search archive, and install with pip install libarchive-c:

> pip.exe install libarchive-c
Collecting libarchive-c
  Downloading libarchive_c-2.7-py2.py3-none-any.whl
Installing collected packages: libarchive-c
Successfully installed libarchive-c-2.7

After cd .. call python, and new library can be used / imported:

>>> import libarchive
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\__init__.py", line 1, in <module>
    from .entry import ArchiveEntry
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\entry.py", line 6, in <module>
    from . import ffi
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 27, in <module>
    libarchive = ctypes.cdll.LoadLibrary(libarchive_path)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 426, in LoadLibrary
   return self._dlltype(name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
TypeError: LoadLibrary() argument 1 must be str, not None

So it fails. I've tried to fix that, but failed with that:

>>> import libarchive
read format "cab" is not supported
read format "7zip" is not supported
read format "rar" is not supported
read format "lha" is not supported
read filter "uu" is not supported
read filter "lzop" is not supported
read filter "grzip" is not supported
read filter "bzip2" is not supported
read filter "rpm" is not supported
read filter "xz" is not supported
read filter "none" is not supported
read filter "compress" is not supported
read filter "all" is not supported
read filter "lzma" is not supported
read filter "lzip" is not supported
read filter "lrzip" is not supported
read filter "gzip" is not supported
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\__init__.py", line 1, in <module>
    from .entry import ArchiveEntry
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\entry.py", line 6, in <module>
    from . import ffi
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 167, in <module>
    c_int, check_int)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 92, in ffi
    f = getattr(libarchive, 'archive_'+name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 361, in __getattr__
    func = self.__getitem__(name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 366, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: function 'archive_read_open_filename_w' not found

Tried with set command to directly provide information, but failed... So I moved to pylzma - for that mingw is not needed. pip install failed:

> pip.exe install pylzma
Collecting pylzma
  Downloading pylzma-0.4.9.tar.gz (115kB)
    100% |--------------------------------| 122kB 1.3MB/s
Installing collected packages: pylzma
  Running setup.py install for pylzma ... error
    Complete output from command c:\users\texxas\appdata\local\programs\python\python36-32\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\texxas\\AppData\\Local\\Temp\\pip-build-99t_zgmz\\pylzma\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\texxas\AppData\Local\Temp\pip-ffe3nbwk-record\install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build\lib.win32-3.6
    copying py7zlib.py -> build\lib.win32-3.6
    running build_ext
    adding support for multithreaded compression
    building 'pylzma' extension
    error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

Again failed. But that is easy one - I've installed visual studio build tools 2015, and that worked. I have sevenzip installed, so I've created sample archive. So finally I can start python and do:

from py7zlib import Archive7z
f = open(r"C:\Users\texxas\Desktop\try.7z", 'rb')
a = Archive7z(f)
a.filenames

And got empty list. Looking closer... gives better understanding - empty files are not considered by pylzma - just to make you aware of that. So putting one character into my sample files, last line gives:

>>> a.filenames
['try/a/test.txt', 'try/a/test1.txt', 'try/a/test2.txt', 'try/a/test3.txt', 'try/a/test4.txt', 'try/a/test5.txt', 'try/a/test6.txt', 'try/a/test7.txt', 'try/b/test.txt', 'try/b/test1.txt', 'try/b/test2.txt', 'try/b/test3.txt', 'try/b/test4.txt', 'try/b/test5.txt', 'try/b/test6.txt', 'try/b/test7.txt', 'try/c/test.txt', 'try/c/test1.txt', 'try/c/test11.txt', 'try/c/test2.txt', 'try/c/test3.txt', 'try/c/test4.txt', 'try/c/test5.txt', 'try/c/test6.txt', 'try/c/test7.txt']

So... rest is a piece of cake. And actually that is a part of original post:

import os
import py7zlib

for folder, subfolders, files in os.walk('.'):
    for file in files:
        if file.endswith('.7z'):
            # sooo 7z archive - extract needed.
            try:
                with open(file, 'rb') as f:
                    z = py7zlib.Archive7z(f)
                    for file in z.list():
                        if arch.getinfo(file).filename.endswith('*.py'):
                            arch.extract(file, './dest')
            except py7zlib.FormatError as e:
                print ('file ' + file)
                print (str(e))

As a side note - Anaconda is great tool, but full install takes 500+MB, so that is way too much.

Also let me share wmctrl.py tool, from my github:

cmd = 'wmctrl -ir ' + str(active.window) + \
      ' -e 0,' + str(stored.left) + ',' + str(stored.top) + ',' + str(stored.width) + ',' + str(stored.height)
print cmd
res = getoutput(cmd)

That way you can orchestrate different commands - here it is wmctrl. Result can be processed, in the way that allows data processing.

HI! I tried your solution in a ASW EC2 Ubuntu environment. I tried to install `pylzma` in order to use the `py7zlib` library, however I got the following error: `UnsupportedPlatformWarning: Multithreading is not supported on the platform "linux2"` I tried to do my homework and I tried some solutions shown here without success https://unix.stackexchange.com/questions/175231/how-to-install-the-pylzma-python-library-on-linux Thanks for your answer. — Metafaniel, Nov 14 '17 at 18:02
@Metafaniel I'll edit my comment... instead of Cygwin use mingw - you'll have bash.exe with all nice stuff. Download Python from their home - there is pip.exe there, so you can use it to install - from cmd.exe, or bash... Windows on my VirtualBox failed, that is why it takes so much time - sorry for that — Michał Zaborowski, Nov 14 '17 at 22:00
@Metafaniel I'll perform all steps on my VirtualBox and describe step by step. — Michał Zaborowski, Nov 14 '17 at 22:00
Thanks again for your interest. I'm sure I'll try your way once I have some spare time. I'll give you some feedback when I have the time. Thanks again — Metafaniel, Nov 15 '17 at 19:29
Would be great. I'll going to help friend, so here I had a chance to train :) — Michał Zaborowski, Nov 15 '17 at 19:40

nipunasudha · Answer 4 · 2017-11-14T18:43:16.663

You state it is OK to use linux, in the question bounty footer. And also I don't use windows. Sorry about that. I am using Python3 on, and you have to be in a linux environment (I will try to test this on windows as soon as I can).

Archive structure

datadir.rar
          |
          datadir/
                 |
                 zip1.7z
                 zip2.7z
                 zip3.7z
                 zip4.7z
                 zip5.7z

Extracted structure

extracted/
├── zip1
│   ├── (E) [!].txt
│   ├── (J) [!].txt
│   └── (U) [!].txt
├── zip2
│   ├── (E) [!].txt
│   ├── (J) [!].txt
│   └── (U) [!].txt
├── zip3
│   ├── (J) [!].txt
│   └── (U) [!].txt
└── zip5
    ├── (J).txt
    └── (U).txt

Here is how I did it.

import libarchive.public
import os, os.path
from os.path import basename
import errno
import rarfile

#========== FILE UTILS =================

#Make directories
def mkdir_p(path):
    try:
        os.makedirs(path)
    except OSError as exc: # Python >2.5
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else: raise

#Open "path" for writing, creating any parent directories as needed.
def safe_open_w(path):
    mkdir_p(os.path.dirname(path))
    return open(path, 'wb')

#========== RAR TOOLS ==================

# List
def rar_list(rar_archive):
    with rarfile.RarFile(rar_archive) as rf:
        return rf.namelist()

# extract
def rar_extract(rar_archive, filename, path):
    with rarfile.RarFile(rar_archive) as rf:
        rf.extract(filename,path)

# extract-all
def rar_extract_all(rar_archive, path):
    with rarfile.RarFile(rar_archive) as rf:
        rf.extractall(path)

#========= 7ZIP TOOLS ==================

# List
def zip7_list(zip7file):
    filelist = []
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            filelist.append(entry.pathname.decode("utf-8"))
    return filelist

# extract
def zip7_extract(zip7file, filename, path):
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            if entry.pathname.decode("utf-8") == filename:
                with safe_open_w(os.path.join(path, filename)) as q:
                    for block in entry.get_blocks():
                        q.write(block)
                break

# extract-all
def zip7_extract_all(zip7file, path):
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            if os.path.isdir(entry.pathname.decode("utf-8")):
                continue
            with safe_open_w(os.path.join(path, entry.pathname.decode("utf-8"))) as q:
                for block in entry.get_blocks():
                    q.write(block)

#============ FILE FILTER =================

def exclamation_filter(filename):
    return ("[!]" in filename)

def optional_code_filter(filename):
    return not ("[" in filename)

def has_exclamation_files(filelist):
    for singlefile in filelist:
        if(exclamation_filter(singlefile)):
            return True
    return False

#============ MAIN PROGRAM ================

print("-------------------------")
print("Program Started")
print("-------------------------")

BIG_RAR = 'datadir.rar'
TEMP_DIR = 'temp'
EXTRACT_DIR = 'extracted'
newzip7filelist = []

#Extract big rar and get new file list
for zipfilepath in rar_list(BIG_RAR):
    rar_extract(BIG_RAR, zipfilepath, TEMP_DIR)
    newzip7filelist.append(os.path.join(TEMP_DIR, zipfilepath))

print("7z Files Extracted")
print("-------------------------")

for newzip7file in newzip7filelist:
    innerFiles = zip7_list(newzip7file)
    for singleFile in innerFiles:
        fileSelected = False
        if(has_exclamation_files(innerFiles)):
            if exclamation_filter(singleFile): fileSelected = True
        else:
            if optional_code_filter(singleFile): fileSelected = True
        if(fileSelected):
            print(singleFile)
            outputFile = os.path.join(EXTRACT_DIR, os.path.splitext(basename(newzip7file))[0])
            zip7_extract(newzip7file, singleFile, outputFile)

print("-------------------------")
print("Extraction Complete")
print("-------------------------")

Above the main program, I've got all the required functions ready. I didn't use all of them, but I kept them in case you need them.

I used several python libraries with python3, but you only have to install libarchive and rarfile using pip, others are built-in libraries.

And here is a copy of my source tree

Console output

This is the console output when you run this python file,

-------------------------
Program Started
-------------------------
7z Files Extracted
-------------------------
(J) [!].txt
(U) [!].txt
(E) [!].txt
(J) [!].txt
(U) [!].txt
(E) [!].txt
(J) [!].txt
(U) [!].txt
(J).txt
(U).txt
-------------------------
Extraction Complete
-------------------------

Issues

The only issue I faced so far is, there are some temporary files generating at the program root. It doesn't affect the program in anyway, but I'll try to fix that.

edit

You have to run

sudo apt-get install libarchive-dev

to install the actual libarchive program. Python library is just a wrapper arround it. Take a look at the official documentation.

@MichałZaborowski File path delimiters etc are unix specific, so this won't be working on windows. OP says it's ok to use linux, please read the question completely before down voting. And I don't use windows, so I can't test this on windows. quoting OP `maybe in command line not necessailly to be done in Windows,it can be done with Linux/bash, etc`. And hate commenting on the fellow users answers won't do any good for your answer. sorry. — nipunasudha, Nov 09 '17 at 08:06
OP wrote, he is going to use Cygwin for that. Libarchive is wrapper, and there are many of them, everyone with a bit different interface. And you are sure that using `pip` OP will install anything to python in version 3? In Cygwin? — Michał Zaborowski, Nov 09 '17 at 08:40
Yup, I originally asked the question regarding Cygwin but because of the lack of answers and my need to find an answer, it's OK for me if a Linux only answer helps. Thanks to both — Metafaniel, Nov 14 '17 at 18:22
About the answer, I did my homewrok and tried to install these libraries to test your answer. However I face this error: `error: libarchive.so: cannot open shared object file: No such file or directory` I ensured I'm using python3 pip and not python2 as I have both versions. I even upgraded pip version but that didn't help, so I haven't tested your idea yet. I don't use Python. Sorry about that... — Metafaniel, Nov 14 '17 at 18:24
You have to run `apt-get install libarchive-dev` to install the actual program. Python library is just a wrapper arround it. Take a look at the website I linked to the `libarchive` link — nipunasudha, Nov 14 '17 at 18:37