1

I need to get all files with media like extension( .png, .jpg, .mp4, .avi, .flv ) in a list by using regex.What i had tried is Below

import re
st = '''
/mnt/data/Content:
ManifestFile.txt                               kompass-tags_e2d5dac5ba548a1206b5d40f58e448e4  tmp_content
default_55a655f340908dce55d10a191b6a0140       price-tags_b3c756dda783ad0691163a900fb5fe15

/mnt/data/Content/default_55a655f340908dce55d10a191b6a0140:
LayoutFile_34450b33c8b44af409abb057ddedfdfe.txt  blank_decommissioned.jpeg                         tmp_content
ManifestFile.txt                                 blank_unregistered.png

/mnt/data/Content/default_55a655f340908dce55d10a191b6a0140/tmp_content:

/mnt/data/Content/kompass-tags_e2d5dac5ba548a1206b5d40f58e448e4:
0001111084948-kompass-LARGE.avi                  0076738703404-kompass-LARGE.png                  LayoutFile_7c1b3793e49204982e0e41923303c17b.txt
0001111087321-kompass-LARGE.jpg                  0076738703419-kompass-LARGE.mp4                  ManifestFile.txt
0001111087325-kompass-LARGE.png                  0076738703420-kompass-LARGE.png                  tmp_content

/mnt/data/Content/kompass-tags_e2d5dac5ba548a1206b5d40f58e448e4/tmp_content:

/mnt/data/Content/price-tags_b3c756dda783ad0691163a900fb5fe15:
0001111084948-consumer-large.png                 0076738703404-consumer-large.png                 LayoutFile_a694b1e05d08705aaf4dd589ac61d493.txt
0001111087321-consumer-large.png                 0076738703419-consumer-large.avi                 ManifestFile.txt
0001111087325-consumer-large.mp4                 0076738703420-consumer-large.png                 tmp_content

/mnt/data/Content/price-tags_b3c756dda783ad0691163a900fb5fe15/tmp_content:

/mnt/data/Content/tmp_content:

'''
patt = '^.*(.png|.jpg|.gif|.bmp|.jpeg|.mp4|.avi|.flv)'
patt = '^.*$.png'

fList = re.findall(patt, st)
print fList

I have very less idea about regex please help.

Sachhya
  • 1,260
  • 1
  • 9
  • 16
  • Try `patt = r'\S+\.(?:png|jpe?g|gif|bmp|mp4|avi|flv)\b'`. Check [this Python demo](https://ideone.com/68eyG0) for the result - is it the expected result? – Wiktor Stribiżew Mar 23 '18 at 09:49
  • Do you need a regex for homework or something? `str.endswith()` would seem a much simpler way to go, like `[s for s in st.split() if s.endswith(('.png', '.jpg', '.mp4', '.avi', '.flv'))]` – Chris_Rands Mar 23 '18 at 09:54

3 Answers3

3

The ^.*(.png|.jpg|.gif|.bmp|.jpeg|.mp4|.avi|.flv) pattern matches the start of a string, then any 0+ chars other than line break chars as many as possible and then the extensions with any single char before them (an unescaped . matches any char but a line break char). So, this can't work for you since . matches too much here and ^ only yields a match at the start of the string.

The ^.*$.png pattern only matches the start of the string, any 0+ chars other than line break chars then the end of string and any char + png - this is a pattern that will never match any string.

Judging by your description you need

patt = r'\S+\.(?:png|jpe?g|gif|bmp|mp4|avi|flv)\b'

See the regex demo.

Details

  • \S+ - 1+ non-whitespace chars
  • \. - a literal dot
  • (?:png|jpe?g|gif|bmp|mp4|avi|flv) - a non-capturing group (i.e. what it captures won't be part of the list returned by re.findall) matching any of the mentioned extenstions
  • \b - a word boundary (actually, it is optional, but it will make sure you match an extension above as a whole word).

See the Python demo:

import re
st = '<YOUR_STRING_HERE>'
patt = r'\S+\.(?:png|jpe?g|gif|bmp|mp4|avi|flv)\b'    
fList = re.findall(patt, st)
for s in fList:
    print(s)

yielding

blank_decommissioned.jpeg
blank_unregistered.png
0001111084948-kompass-LARGE.avi
0076738703404-kompass-LARGE.png
0001111087321-kompass-LARGE.jpg
0076738703419-kompass-LARGE.mp4
0001111087325-kompass-LARGE.png
0076738703420-kompass-LARGE.png
0001111084948-consumer-large.png
0076738703404-consumer-large.png
0001111087321-consumer-large.png
0076738703419-consumer-large.avi
0001111087325-consumer-large.mp4
0076738703420-consumer-large.png
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • what is the meaning of 'r' at the beginning of pattern. – Sachhya Mar 23 '18 at 10:06
  • 1
    @Sachhya It is a [raw string literal prefix](https://stackoverflow.com/questions/4780088/what-does-preceding-a-string-literal-with-r-mean). [More details here](https://stackoverflow.com/questions/2081640/what-exactly-do-u-and-r-string-flags-do-and-what-are-raw-string-literals). If you do not add it, you will have to escape ``\`` before regex escapes (`r"\b"` = `"\\b"`). BTW, `jpe?g` = `jpg|jpeg` but is more efficient (`e?` matches 1 or 0 `e` chars). – Wiktor Stribiżew Mar 23 '18 at 10:09
2

You can use the RegEx \S+\.(?:png|jpg|gif|bmp|jpeg|mp4|avi|flv)

  • \S+ matches any non white-space char at least one time

  • \. matches a dot

  • (?: ... ) is a non capturing group

  • (png|jpg|gif|bmp|jpeg|mp4|avi|flv matches your defined extensions

Demo.

Zenoo
  • 12,670
  • 4
  • 45
  • 69
  • thanks for your help but this return a list of tuple whose first element is file name and second element is its extension. like --> [('blank_decommissioned', '.jpeg'), ('blank_unregistered', '.png'), ('0001111084948-kompass-LARGE', '.avi'), ('0076738703404-kompass-LARGE', '.png'), ('0001111087321-kompass-LARGE', )] – Sachhya Mar 23 '18 at 09:59
  • @Zenoo You do not need the outer capturing and the first non-capturing groupings. Besides, `[^\s]` = `\S`. If you put `\.` outside the capturing group, you won't have to repeat it before each alternative. – Wiktor Stribiżew Mar 23 '18 at 10:01
1

Try this:

patt = '[^ \n]+?\.(?:png|jpg|gif|bmp|jpeg|mp4|avi|flv)'

[^ \n] is a negated character class, allowing no spaces or newlines.

The dot (.) is a special character and needs to be escaped with a backslash.

Try it online here.

O.O.Balance
  • 2,930
  • 5
  • 23
  • 35