2

I'm trying to use Perl to reorder the content of an md5 file. For each line, I want the filename without the path then the hash. The best command I've come up with is:

$ perl -pe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5

The input file (DCIM.md5) is produced by md5sum on Linux. It looks like this:

e26ff03dc1bac80226e200c0c63d17a2  ./Path1/IMG_20150201_160548.jpg
01f92572e4c6f2ea42bd904497e4f939  ./Path 2/IMG_20150204_190528.jpg
afce027c977944188b4f97c5dd1bd101  ./Path3/Path 4/IMG_20151011_193008.jpg
  1. The hash is matched by the first group ([[:alnum:]]+) in the
    regular expression.
  2. Then the spaces and the path to the file are
    matched by .*?.
  3. Then the filename is matched by ([^/]+).
  4. The expression is enclosed with ^ (apparently non-necessary here) and $. Without the $, the expression does not output what I expect.
  5. I use | rather than / as a separator to avoid escaping it in file paths.

That command returns:

IMG_20150201_160548.jpg
 e26ff03dc1bac80226e200c0c63d17a2IMG_20150204_190528.jpg
 01f92572e4c6f2ea42bd904497e4f939IMG_20151011_193008.jpg
 afce027c977944188b4f97c5dd1bd101IMG_20151011_195133.jpg

The matching is correct, the output sequence is correct (filename without path then hash) but the spacing is not: there's a newline after the filename. I expect it after the hash, like this:

IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101

It seems to me that my command outputs the newline character, but I don't know how to change this behavior. Or possibly the problem comes from the shell, not the command?

Finally, some version information:

$ perl -version
This is perl 5, version 22, subversion 1 (v5.22.1) built for i686-linux-gnu-thread-multi-64int
(with 69 registered patches, see perl -V for more detail)
ikegami
  • 367,544
  • 15
  • 269
  • 518
raph82
  • 117
  • 1
  • 9
  • 1
    `[^/]`will match non slash (including newlines). Maybe try `[^\n/]` – Chris Charley Sep 15 '18 at 17:58
  • 1
    If you're matching a hex string then it would be better to use `[[:xdigit:]]` rather than `[[:alnum:]]`. Alternatively, the Unicode property `\p{Hex}` is more concise and more modern. – Borodin Sep 15 '18 at 19:59
  • I think your work ("your" referring to the OP's) on solving this problem, especially your strategy for getting the filename, is ingenious! I will combine it with the answers given. It will be better than my usual: `s|^.*/(.*)$|$1|` approach (which must be used with the `-l` switch, as discussed by @Shawn, as well as the `-p` switch you've used). It'll be `perl -lpe 's|([^/])$|$1|` for me to find filenames being the last entry on a line. I love it! – bballdave025 Sep 15 '18 at 20:39
  • One other suggestion: Use `[:xdigit:]` instead of `[:alnum:]`, to help with the (admittedly unlikely) case of getting a line that doesn't start with a valid hash. – Shawn Sep 15 '18 at 21:54
  • @ChrisCharley, it's very correct that the `[^/]` will match newlines. Am I correct in surmising that it's actually the plus after, i.e. `` [^/] **+** ``, that causes this behavior to be a problem? – bballdave025 Sep 15 '18 at 22:38

4 Answers4

5

[^/]+ matches newlines, so the ones in your input are part of $2, which gets put first in your transformed $_ (And there's no newline in $1 so there's no newline at the end of $_...)

Solution: Read up on the -l option from perlrun. In particular:

-l[octnum] enables automatic line-ending processing. It has two separate effects. First, it automatically chomps $/ (the input record separator) when used with -n or -p. Second, it assigns $\ (the output record separator) to have the value of octnum so that any print statements will have that separator added back on. If octnum is omitted, sets $\ to the current value of $/ .

Shawn
  • 47,241
  • 3
  • 26
  • 60
  • So, basically, you're saying that all @raph82 was missing was an `-l` flag to go along with the `-pe`, then the goal would be accomplished? It worked for me when I tried it. `$ perl -lpe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5` gave me the desired output. – bballdave025 Sep 15 '18 at 20:45
  • 1
    @bballdave025 Yup. No need to muck with the regular expression and make it even more cryptic and linenoisey. Even if it is just for a one-liner. – Shawn Sep 15 '18 at 21:18
  • I like that approach. Good 'on ya. – bballdave025 Sep 15 '18 at 21:33
3

Alternate solution, which uses lots of concepts from other answers, and comments ...

$ perl -pe 's|(\p{hex}+).*?([^/]+?)$|$2 $1|' DCIM.md5

... and explanation.

After investigating all the answers and trying to figure them out, I've decided that the base of the problem is that the [^/]+ is greedy. Its greediness causes it to capture the newline; it ignores the $ anchor.

This was hard for me to figure out, since I did a lot of parsing using sed before using Perl, and even a greedy wildcard won't capture a newline in sed. Hopefully this post will help those who (being used to sed as I am) are also wondering (as I did) why the $ isn't acting "as I expect it to."

We can see the "greedy" issue by trying what I'll post as another, alternate answer.

Write the file:

$ cat > DCIM.md5<<EOF
> e26ff03dc1bac80226e200c0c63d17a2  ./Path1/IMG_20150201_160548.jpg
> 01f92572e4c6f2ea42bd904497e4f939  ./Path 2/IMG_20150204_190528.jpg
> afce027c977944188b4f97c5dd1bd101  ./Path3/Path 4/IMG_20151011_193008.jpg
> EOF

Get rid of the greedy [^/]+ by changing it to [^/]+?. Parse.

$ perl -pe 's|([[:alnum:]]+).*?([^/]+?)$|$2 $1|' DCIM.md5
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101

Desired output accomplished.

The accepted answer, by @Shawn,

$ perl -lpe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5

basically changes the $ anchor so as to behave the way a sed person would expect it to.

The answer by @CrafterKolyan takes care of the greedy [^/] capturing the newline by saying you can't have a forward-slash or a newline. This answer still needs the $ anchor to prevent the following situation

1) .* captures the empty string (0 or more of any character)

2) [^/\n]+ captures . .

The answer by @Borodin takes a quite different approach, but it's a great concept.

@Borodin, in addition, made a great comment that allows a more-precise/more-exact version of this answer, which is the version I put at the top of this post.

Finally, if one wants to follow the Perl programming model, here's another alternative.

$ perl -pe 's|([[:xdigit:]]+).*?([^/]+?)(\n\|\Z)|$2 $1$3|' DCIM.md5

P.S. Because sed isn't quite like perl (no non-greedy wildcards,) here's a sed example that shows the behavior I discuss.

$ sed 's|^\([[:alnum:]]\+\).*/\([^/]\+\)$|\2 \1|' DCIM.md5

This is basically a "direct translation" of the perl expression except for the extra '/' before the [^/] stuff. I hope it will help those comparing sed and perl.

bballdave025
  • 1,347
  • 1
  • 15
  • 28
2

use [^/\n] instead of [^/]:

perl -pe 's|^([[:alnum:]]+).*?([^/\n]+)$|$2 $1|' DCIM.md5

CrafterKolyan
  • 1,042
  • 5
  • 13
  • Why was this down-voted? It gives precisely the output desired by the OP. In addition, it gives executable code, which is always good to have in an answer. – bballdave025 Sep 15 '18 at 21:53
2

Doing a substitution leaves you having to write a regex pattern that matches everything you don't want as well as everything you do. It's usually much better to match just the parts you need and build another string from them

Like this

for ( <> ) {
    die unless m< (\w++) .*? ([^/\s]+) \s* \z >x;
    print "$2 $1\n";
}

or if you must have a one-liner

perl -ne 'die unless m< (\w++) .*? ([^/\s]+) \s*\z >x; print "$2 $1\n";' myfile.md5

output

IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
Borodin
  • 126,100
  • 9
  • 70
  • 144