1

I have some filename that contain some redundant words that I want to get rid of like: VIS, THE etc.

I was this regex but the problem is that the words to be removed can appear in the front or in the back of the filename. To make it clearer some samples of filenames are:

filenames = ['a_VIS-MarnehNew_24RGB_1110.jpg',
             'Marne_04_Vis.jpg',
             'VIS_jeep_smoke.jpg',
             'IR_fk_ref_01_005.jpg',
             'c_LWIR-MarnehNew_24RGB_1110.jpg',
             'LWIR-MarnehNew_15RGB_603.jpg',
             'Movie_01_IR.jpg',
             'THE_fk_ge_03_005.jpg']

And the redundant words are VIS, Vis, IR, LWIR, THE and every character before them if they appear at the front or every character after them if they appear at the back.

Correct examples would be:

filenames = ['MarnehNew_24RGB_1110',
             'Marne_04',
             'jeep_smoke',
             'fk_ref_01_005',
             'MarnehNew_24RGB_1110',
             'MarnehNew_15RGB_603',
             'Movie_01',
             'fk_ge_03_005']

I tried this code but (obviously it's insufficient for the back cases:

import re
pattern = re.compile('(?:VIS|Vis|IR|LWIR)(?:-|_)(\w+)')

for i, filename in enumerate(filenames):
    matches = re.search(pattern, filename)
    if matches:
        print(i, matches.group(1))

0 MarnehNew_24RGB_1110
2 jeep_smoke
3 fk_ref_01_005
4 MarnehNew_24RGB_1110
5 MarnehNew_15RGB_603

So, how do I manage to also get rid of the back words also?

Eypros
  • 5,370
  • 6
  • 42
  • 75
  • 1
    Why is `a_VIS-` removed from the first example? That doesn't match the pattern. – Martijn Pieters Oct 17 '18 at 11:58
  • What do you mean. The desired pattern or the pattern I provide? For the latter I am not sure why to be honest (I am no regex expert) – Eypros Oct 17 '18 at 12:11
  • 1
    Something like [`^(?:(?:(?!VIS|IR|LWIR|THE).){0,4}(VIS|IR|LWIR|THE)[-_])?((?:(?!_(?:VIS|IR|LWIR|THE))\w)*)`](https://regex101.com/r/zqpdKt/2/) – Sebastian Proske Oct 17 '18 at 12:14
  • Another approach would be removing file extension, splitting on `(?:^|[_-])(?:|VIS|IR|LWIR|THE)(?:[_-]|$)` and taking the longest element of the split result (which would somewhat fail if it were split into 3 items and the middle one wouldn't be the longest, thought that could be accounted for as well) – Sebastian Proske Oct 17 '18 at 12:19
  • Right, I see that `(?:VIS|Vis|IR|LWIR)(?:-|_)(\w+)` isn't anchored, so you effectively allow the substrings to appear *anywhere* in the pattern and just take the remainder. Yo could use `(?:(?:VIS|Vis|IR|LWIR)(?:-|_))?(\w+)(?:(?:-|_)(?:VIS|Vis|IR|LWIR))?` to allow the pattern both before and after. – Martijn Pieters Oct 17 '18 at 12:23

1 Answers1

1

Using your examples you could use

(?:^(?:\w_)?(?:VIS|Vis|IR|LWIR|THE)[-_]?)
|
(?:_?(?:VIS|Vis|IR|LWIR))?\.jpg$

Which needs to be replaced by nothing, see a demo on regex101.com.


Broken down this says:
(?:                          # non-capturing group
    ^                        # anchor at the beginning of a string
    (?:\w_)?                 # \w_ optional
    (?:VIS|Vis|IR|LWIR|THE)  # one of ...
    [-_]?                    # - or _ optional
)
|                            # OR
(?:
    _?
    (?:VIS|Vis|IR|LWIR)
)?
\.jpg$
Jan
  • 42,290
  • 8
  • 54
  • 79
  • OPs test cases miss a real case for _every character after them if they appear at the back_ like e.g. `abcdef_VIS_a.jpg`. I'm also not quite sure if _every character before_ can be read as only 1 character, but again the test cases seem to imply so (or aren't good enough). But I might as well be overthinking this. – Sebastian Proske Oct 17 '18 at 12:32
  • @SebastianProske: Let's wait and see - otherwise I'l delete this answer. – Jan Oct 17 '18 at 12:38
  • My real cases does not have any character after the key words at the present so your solution suits my needs. I indeed posted that the solution should include all trailing characters also though. Would it be difficult to include this case also? – Eypros Oct 17 '18 at 12:51
  • @Eypros: No, but it would be easier to pose a newer question, I guess. – Jan Oct 17 '18 at 14:02