1

I want to match some sub strings in an url.

Eg url's :-

www.google.com/images

www.google.com/images.pdf

Currently I have the re = r"([^.]*$)"

This works for case 1 as intended but fails for 2.

www.google.com/images.pdf matches .pdf -> Working as intended

www.google.com/images matches .com/images -> Failing

I want the re to not match when it cannot find a . and finds the first / from the end.

Please don't suggest doing this with .endswith. I don't have the list of all possible extensions that I need to match.

skyfail
  • 404
  • 1
  • 6
  • 18
  • I don't fully understand what you want but how about `r"(\.[^./]*$)"`? [Try it](https://regex101.com/r/Fs6QYh/1) – Michael Butscher Nov 26 '18 at 21:32
  • @MichaelButscher That's exactly what I want !! Can you please post it as an answer so that I can mark it as correct. And if possible a small write on what's happening ? Thanks a ton ! – skyfail Nov 26 '18 at 21:35
  • @anubhava It doesn't sadly. It matches /images.pdf instead of just .pdf. – skyfail Nov 26 '18 at 21:38
  • Maybe I'm not understanding your question, but it seems you want to do something like this https://stackoverflow.com/questions/4776924/how-to-safely-get-the-file-extension-from-a-url If you just look at strings of URLs then never mind – DerMolly Nov 26 '18 at 21:40
  • @DerMolly Interesting find. I should've found that. That might be something I can use. But for now Michael's answer works for me. Thanks a ton ! – skyfail Nov 26 '18 at 21:44

2 Answers2

2

Use expression r"(\.[^./]*$)"

It's best to look at it from end to beginning:

From end of line take as much text characters as possible which are neither a / (so the whole match belongs to last path element) nor a . so not more than the possible suffix is eaten. Finally (at the beginning) there must be a . in the match so the whole match is the suffix of the last path element (usually a file) if present.

Michael Butscher
  • 10,028
  • 4
  • 24
  • 25
0

Try this:

/[^\.]*(\..*)$

From left to right, this says: look for a forward slash, followed by any string of characters excluding a period ("[^\.]") any number of times ("*"), then look for the string starting with a period ("\.") followed by any remaining characters. But do all of this from the end ("$"). I've noticed "[^.]" seems to work in place of "[^\.]" too, so maybe my "\." is redundant here.

Bill M.
  • 1,388
  • 1
  • 8
  • 16