0

I'm trying to write a regex which gets everything but a specified pattern. I've been trying to use negative lookahead but whenever testing my expression, it never works.

I have files that are of this form:

(garbage info) filename (other garbage).extension or [garbage info] filename [other garbage].extension

For example, one of the files is [O2CXDR] report january [77012].pdf or (XEW7CK) sales commissions (99723).xls

I'm using the regex.h library in C so I believe that it is a POSIX library.

I'm hoping on extracting "filename" and ".extension" so that I can write a script which will the files filename.extension

So far, I have a an expression to select the garbage info with the brackets and the spaces around it but I'm unable to select the rest.

\s*(\[|\().*?(\]|\))+\s*

and the negative lookahead I tried was:

.*(?!(\s*(\[|\().*?(\]|\))+\s*)).*

but it's just selecting everything in a single match.

I'm sure that I'm not understanding the lookaheads and lookbehind correctly. What do I have to do to fix my expression? Could somebody explain how they work since I'm a bit lost. Thanks!

3 Answers3

1

Maybe, as simple as

^(?:\(([^)]*)\)\s*([^(\r\n]*?)\s*\(([^)]*)\)|\[([^\]]*)\]\s*([^(\r\n]*?)\s*\[([^\]]*)\])\.(.*)$

we could extract those values.

Demo 1

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

If you don't need all of those capturing groups, we'd then simply remove those that we wouldn't want:

^(?:\([^)]*\)\s*([^(\r\n]*?)\s*\([^)]*\)|\[[^\]]*\]\s*([^(\r\n]*?)\s*\[[^\]]*\])\.(.*)$

Demo 2

Emma
  • 27,428
  • 11
  • 44
  • 69
1
$ cat input_file
(garbage info) filename (other garbage).extension
 (garbage info)filename(other garbage).extension
(garbage info)file name(other garbage).extension
[garbage info] filename [other garbage].extension
 [garbage info]filename[other garbage].extension
[garbage info]file name[other garbage].extension
$ sed -re 's/^\s*(\([^\)]*\)|\[[^]]*\])\s*(.*\S)\s*(\([^\)]*\)|\[[^]]*\])(\..*)$/\2\4/' input_file
filename.extension
filename.extension
file name.extension
filename.extension
filename.extension
file name.extension
AlexP
  • 4,370
  • 15
  • 15
1

Since you haven't specified a regex engine, I'll target a subset that can use the tags \K, \G, and \A (like PCRE).

The following uses a combination of match resets (\K), tempered greedy token, and start of match (without start of string) \G(?!\A), further explained below:

See regex in use here

Note: remove empty matches

\s*[[(].*?[])]\s*\K|\G(?!\A)(?:(?!\s*[[(].*?[])]\s*).)+
  • Match one of the following:
    • Option 1:
      • \s* Match any whitespace any number of times
      • [[(] Match either [ or (
      • .*? Match any character any number of times, but as few as possible (lazy matching)
      • [])] Match either ] or )
      • \s* Match any whitespace any number of times
      • \K Reset match - sets the given position in the regex as the new start of the match. This means that nothing preceding this tag will be captured in the overall match.
    • Option 2:
      • \G(?!\A) Match only at the starting point of the search or position of the previous successful match end, but not at the start of the string.
      • (?:(?!\s*[[(].*?[])]\s*).)+ Tempered greedy token matching anything more than once except the negative lookahead pattern (which is the same as the first option).
ctwheels
  • 21,901
  • 9
  • 42
  • 77