3

I need to uncompress all the files in a directory and for this I need to find the first file in the set. I'm currently doing this using a bunch of if statements and loops. Can i do this this using regex?

Here's a list of files that i need to match:

yes.rar
yes.part1.rar
yes.part01.rar
yes.part001.rar
yes.r01
yes.r001

These should NOT be matched:

no.part2.rar
no.part02.rar
no.part002.rar
no.part011.rar
no.r002
no.r02

I found a similar regex on this thread but it seems that Python doesn't support varible length lookarounds. A single line regex would be complicated but I'll document it well and it's not a problem. It's just one of those problems you beat your heap up, over.

Thanks in advance guys.

:)

Community
  • 1
  • 1
Mridang Agarwalla
  • 43,201
  • 71
  • 221
  • 382
  • It only doesn't support variable length look-behinds. Look-aheads are fine. – reko_t Mar 29 '10 at 13:03
  • 3
    Looking at the filenames is a fundamentally incorrect approach to dealing with files. Filenames are just metadata; an annotation. Whenever possible, you should rely on the actual content of the files rather than their names. The actual content of the files will be correct as long as the file is not corrupt. In fact, that is the definition of corruption. For example, a first-volume file named `file.part8.rar` is not corrupt, but a filename-based approach will fail to recognize it as the first volume. – Welbog Mar 29 '10 at 13:21
  • You *have* to look at filenames to determine which files are in a RAR set in the first place; that's how RAR volumes are associated, and any other approach would require opening every file in the directory, which would be much slower in large directories. – Glenn Maynard Mar 29 '10 at 17:21
  • @Glenn: What if the files have no extensions and have randomly-assigned names? I have to deal with files like that on a regular basis and figure out what types they are by their headers. You don't always have the luxury of sane filenames. – Welbog Mar 30 '10 at 13:03
  • 2
    Then they're not split RARs, which is the file format in question. Filenames are part of the RAR file format; if you break the association between split RARs by renaming them, you'll need to define your own mechanism external to the RAR file format to reestablish it later on. WinRAR itself won't "discover" associated parts except by filenames. You'll notice that "New volume naming scheme" is even explicitly mentioned in your link (http://www.win-rar.com/index.php?id=24&kb_article_id=162), which is referring to the expected filename layout. – Glenn Maynard Mar 30 '10 at 22:14

3 Answers3

5

Don't rely on the names of the files to determine which one is first. You're going to end up finding an edge case where you get the wrong file.

RAR's headers will tell you which file is the first on in the volume, assuming they were created in a somewhat-recent version of RAR.

HEAD_FLAGS Bit flags:
2 bytes

0x0100 - First volume (set only by RAR 3.0 and later)

So open up each file and examine the RAR headers, looking specifically for the flag that indicates which file is the first volume. This will never fail, as long as the archive isn't corrupt.


Update: I've just confirmed this by taking a look at some spanning archives in a hex editor. The files headers are constructed exactly as the link above indicates. It's just a matter of opening the files and reading the header for that flag. The file with that flag is the first volume.

Welbog
  • 59,154
  • 9
  • 110
  • 123
  • Hi Welbog. I understand what you mean. I've been working on Python for while but I've never had to work with something similar to this. Could I also use a similar approach for ZIP files? DO you know a good tutorial for reading file headers? Is there a library of some sort? You solution sounds very robust. I wish my Python skills were equally as robust. :( Thank you. – Mridang Agarwalla Mar 29 '10 at 13:43
  • Do you know how to read files in Python? Because that's all you have to do. I don't know any Python but I can't imagine file IO is difficult in it. – Welbog Mar 29 '10 at 13:50
  • Makes sense now. File IO is pretty easy and I've worked with it. I'll do some digging and look for something similar. I saw some examples of reading JPEG headers. I'll have a look at those to understand how it works. Thanks a ton! – Mridang Agarwalla Mar 29 '10 at 13:52
  • Hi Welbog. I ended up using your solution after all. Not in Python but in Java. Here: http://stackoverflow.com/a/11327369/304151 – Mridang Agarwalla Jul 04 '12 at 12:49
  • I came here because I wanted a linux shell script that can identify (and unpack) rar-archives. At first I thought I would need a regex but your answer pointed me to the right direction, thanks! If anyone would ever need that too: [here you go](http://pastebin.com/pRm95Aj0) – Gerrit-K Mar 20 '13 at 20:11
3

There's no need to use look behind assertions for this. Since you start looking from the beginning of the string, you can do everything with look-aheads that you can with look-behinds. This should work:

^((?!\.part(?!0*1\.rar$)\d+\.rar$).)*\.(?:rar|r?0*1)$

To capture the first part of the filename as you requested, you could do this:

^((?:(?!\.part\d+\.rar$).)*)\.(?:(?:part0*1\.)?rar|r?0*1)$
reko_t
  • 55,302
  • 10
  • 87
  • 77
  • Moi Reko. I have two issues and i might need to trouble you again. 1. I couldn't match these two. yes.r01 yes.r001 2. Would it be possible to capture the first part of the filename into a capturing group? Like this: yes.part01.rar >> yes testfile.rar >> testfile new.file.part01.rar >> new.file Seems that my regex skills are either pathetic or terribly rusty. Paljon kiitoksia. Mridang. – Mridang Agarwalla Mar 29 '10 at 13:10
  • I edited the regexp so that it'll match the cases you specified. The second regexp will also capture the basename of the filename. – reko_t Mar 29 '10 at 13:23
  • Hi again Reko, I tried the first regex and it worked as expected — matching even the r001, r01 types. The second regex that you wrote seems to capture the file name in a cases where the file name is .r01, r001 or something.rar but it doesn't seem to match the *part* cases. Some more help, please? Thank you for the help. – Mridang Agarwalla Mar 29 '10 at 13:40
  • @mridang: See my answer. Don't use this approach because it is wrong. – Welbog Mar 29 '10 at 13:43
  • @mridang: Sorry, had a little error there, fixed the 2nd regexp now. @Welbog: It's not wrong per-se. It answers the original question just fine, although I agree with you that inspecting the actual header of the file is the right way to approach this problem. – reko_t Mar 29 '10 at 13:50
1

Are you sure you want to match these cases?

yes.r01

They are not the first archives: .rar always is.

It's bla.rar, bla.r00 and then only bla.r01. You'll probably extract the files twice if you match .r01 and .rar as first archive.

yes.r001

.r001 doesn't exist. Do you mean the .001 files that WinRAR supports? After .r99, it's .s00. If it does exist, then somebody manually renamed the files.

In theory, matching on filename should be as reliable as matching on the 0x0100 flag to find the first archive.

Gfy
  • 8,173
  • 3
  • 26
  • 46