11

I am looking for a way to extract a filename and extension from a particular url using Python

lets say a URL looks as follows

picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"

How would I go about getting the following.

filename = "da4ca3509a7b11e19e4a12313813ffc0_7"
file_ext = ".jpg"
ApPeL
  • 4,801
  • 9
  • 47
  • 84

7 Answers7

33
try:
    # Python 3
    from urllib.parse import urlparse
except ImportError:
    # Python 2
    from urlparse import urlparse
from os.path import splitext, basename

picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
disassembled = urlparse(picture_page)
filename, file_ext = splitext(basename(disassembled.path))

Only downside with this is that your filename will contain a preceding / which you can always remove yourself.

Charles L.
  • 5,795
  • 10
  • 40
  • 60
Christian Witts
  • 11,375
  • 1
  • 33
  • 46
  • 2
    the preceding '/' is not the only problem, if the url contains other subdirectories, they will be kept in the filename, maybe OP wants them, maybe not ;) – Cédric Julien May 11 '12 at 13:38
  • @Cédric Julien - Thanks for the reminder about .basename to get just the last portion, edited the post to reflect so. :) – Christian Witts May 11 '12 at 13:47
  • 6
    This code can work with files without extension and urls like `http://server.com/common/image.jpg?xx=345&yy=qwerty` BTW in 3.x one need to use `from urllib.parse import urlparse` – El Ruso Nov 11 '15 at 19:12
12

Try with urlparse.urlsplit to split url, and then os.path.splitext to retrieve filename and extension (use os.path.basename to keep only the last filename) :

import urlparse
import os.path

picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"

print os.path.splitext(os.path.basename(urlparse.urlsplit(picture_page).path))

>>> ('da4ca3509a7b11e19e4a12313813ffc0_7', '.jpg')
Cédric Julien
  • 78,516
  • 15
  • 127
  • 132
10
filename = picture_page.split('/')[-1].split('.')[0]
file_ext = '.'+picture_page.split('.')[-1]
Niek de Klein
  • 8,524
  • 20
  • 72
  • 143
6
# Here's your link:
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"

#Here's your filename and ext:
filename, ext = (picture_page.split('/')[-1].split('.'))

When you do picture_page.split('/'), it will return a list of strings from your url split by a /. If you know python list indexing well, you'd know that -1 will give you the last element or the first element from the end of the list. In your case, it will be the filename: da4ca3509a7b11e19e4a12313813ffc0_7.jpg

Splitting that by delimeter ., you get two values: da4ca3509a7b11e19e4a12313813ffc0_7 and jpg, as expected, because they are separated by a period which you used as a delimeter in your split() call.

Now, since the last split returns two values in the resulting list, you can tuplify it. Hence, basically, the result would be like:

filename,ext = ('da4ca3509a7b11e19e4a12313813ffc0_7', 'jpg')

bad_keypoints
  • 1,382
  • 2
  • 23
  • 45
  • 1
    While your code might (or not) work it would be great if you add a brief explanation about the problem and how does your code solve it. As is it does not provide a full answer according to [help center](http://stackoverflow.com/help/how-to-answer) – dic19 Sep 18 '14 at 15:19
  • It will always work, provided he gets his file urls in a way that the file always has an extension. He could add a simple if statement in the mix to handle files with no extensions ( `if len(url.split('/')[-1].split('.'))==1: #No extension; else: #Get filename,ext` – bad_keypoints Sep 22 '14 at 07:57
  • Please note the point of my comment is not if your code actually works or it doesn't. It's about answer's quality. Note that your answer is better now since you have added a brief explanation as suggested. +1 for your edit :) – dic19 Sep 22 '14 at 11:29
  • Thank you anyways, it made me make my answer better. – bad_keypoints Sep 22 '14 at 13:12
3

os.path.splitext will help you extract the filename and extension once you have extracted the relevant string from the URL using urlparse:

   fName, ext = os.path.splitext('yourImage.jpg')
Levon
  • 138,105
  • 33
  • 200
  • 191
0

This is the easiest way to find image name and extension using regular expression.

import re
import sys

picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"

regex = re.compile('(.*\/(?P<name>\w+)\.(?P<ext>\w+))')

print  regex.search(picture_page).group('name')
print  regex.search(picture_page).group('ext')
-2
>>> import re
>>> s = 'picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"'
>>> re.findall(r'\/([a-zA-Z0-9_]*)\.[a-zA-Z]*\"$',s)[0]
'da4ca3509a7b11e19e4a12313813ffc0_7'
>>> re.findall(r'([a-zA-Z]*)\"$',s)[0]
'jpg'
theharshest
  • 7,767
  • 11
  • 41
  • 51