37

In my bash script I need to extract just the path from the given URL. For example, from the variable containing string:

http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth

I want to extract to some other variable only the:

/one/more/dir/file.exe

part. Of course login, password, filename and parameters are optional.

Since I am new to sed and awk I ask you for help. Please, advice me how to do it. Thank you!

Arek
  • 427
  • 2
  • 6
  • 8
  • 14
    If the OP asks for an answer using bash, awk and/or sed, those are the languages that the answers should target. I'm getting sick of this "substituting your language of choice" on SO. Recently I asked a question about Javascript without a framework because I knew the platform I was targetting wouldn't support it. But all I got was a discussion about why I couldn't use jQuery. Also, I once was developing on an embedded device and Perl for instance was not installed, so I needed to do these sorts of things with awk. So answer the questions using the OP's language(s), or don't answer at all. – Dexygen Jul 29 '09 at 23:18
  • 1
    It depends on your default. In your case, you default to 'assume all requirements not specified in the question are explicitly forbidden'. In this case, the poster is a novice with regexes, and almost certainly doesn't care whether the answer is in sed/awk, perl, or any other *standard* tool. Apart from specialised embedded devices, there is no argument for 'Perl may not be present on the platform'. SO should be a tool for learning as well as a way of getting specific answers. The fact the OP accepted a Perl answer speaks for itself. Your negative votes are a mistake. – ire_and_curses Jul 30 '09 at 07:20
  • 3
    @ire_and_curses you could not be more mistaken that my "negative votes are a mistake" This question's tags contain bash, awk & sed. I was led here through a search on one of those or the other, I forget which. But I should not have to wade through answers using Perl, Ruby or anything else that the question is not tagged with, to find the information pertinent to the search I ran. Indeed I argue it is completely counter to the intention of SO, as it currently exists, to answer questions using languages the OP did not specify. – Dexygen Aug 03 '09 at 22:09
  • Have a look at [http://stackoverflow.com/questions/27745/getting-parts-of-a-url-regex](http://stackoverflow.com/questions/27745/getting-parts-of-a-url-regex) – Paul Lydon Jul 29 '09 at 11:50

13 Answers13

92

There are built-in functions in bash to handle this, e.g., the string pattern-matching operators:

  1. '#' remove minimal matching prefixes
  2. '##' remove maximal matching prefixes
  3. '%' remove minimal matching suffixes
  4. '%%' remove maximal matching suffixes

For example: All of these tested on Bash 3.2.57(1)-release (x86_64-apple-darwin20)

FILE=/home/user/src/prog.c
echo ${FILE#/*/}  # ==> user/src/prog.c
echo ${FILE##/*/} # ==> prog.c
echo ${FILE##*/}  # ==> prog.c // Alternate version for some systems
echo ${FILE%/*}   # ==> /home/user/src
echo ${FILE%%/*}  # ==> nil
echo ${FILE%.c}   # ==> /home/user/src/prog

All this from the excellent book: "A Practical Guide to Linux Commands, Editors, and Shell Programming by Mark G. Sobell (http://www.sobell.com/)

JESii
  • 4,678
  • 2
  • 39
  • 44
  • 4
    Thought I'd let you know that this post greatly helped me. Thanks! – worsnupd Jan 26 '13 at 20:29
  • Okay but in the URL you have // at first so we need something everything after the third slash - how does that work? – Alex Jun 06 '13 at 10:34
  • @Alex... the trick is in the definition of the ## operator: i.e., remove MAXIMAL prefixes; that means it removes everything up to the last '/'. HTH – JESii Jun 07 '13 at 13:02
  • Thank you, @JESii, for this. I asked the intertubes 50 different ways and finally came across this question and your answer. – tobinjim Aug 07 '13 at 09:09
  • 2
    For URL, I would use - `printf -- "%s" "${URL##*/}"` - which will remove anything leading up to the final "/" and is scheme independent. – cchamberlain Jun 19 '15 at 17:11
  • 2
    Note if you have query string params you need to either use 2 separate lines of parameter substitution or you can pipe through sed - `printf -- "%s" "${url##*/}" | sed 's/?.*//'` which replaces the optional ? and anything after it with nothing. – cchamberlain Jun 19 '15 at 17:31
  • I wasn't looking for this question but your comment, nonetheless, helped me. Though now I'm wondering something (possibly off-topic, but I don't want to be told by an admin to refer to this post). Basically, I'm using your response to list out my node modules for my node installation installed via nvm, and I was wondering if there's a way to take the output (which works) and pipe it into `npm install -g` for when I install a new node version. I tried just piping it to `pbcopy` and pasting it after `npm install -g`, but I get a bunch of failures. – sunny-mittal Jul 30 '15 at 06:46
  • @sunny-mittal - I would try piping that into xargs which you can then use to run the command once for each argument. I always have to look it up (See http://ss64.com/bash/xargs.html for example), but it might look something like this from a simple ls: `ls | xargs npm install`. I found xargs a little hard to wrap my head around at first, but it's really powerful for "batching" commands on multiple files. – JESii Aug 01 '15 at 03:37
  • How would one extract the last part of an url if it ends with a `/`? So, the url would be: `https://www.my.url/with/several/folders/` and the result should just be `folders`. – The Oddler Jul 26 '20 at 17:33
  • @TheOddler - I don't know of any way to do that with the above "Pattern Matching" syntax. However, it's easy with `basename https://www.my.url/with/several/folders/` which (on my bash version 5.0.17(1)) returns the desired `folders`. HTH – JESii Jul 30 '20 at 22:48
  • @JESii Cool, that's much simpler! What I ended up doing before your post was a bit more manual work, first remove the `/` at the end by doing `${LINK%/}` and then the beginnen part with `${LINK##https://*/}`. Now using `basename` it's cleaner and much more readable. Thanks! – The Oddler Jul 31 '20 at 06:54
  • @Anthony O. Interesting change I just noticed; on my version of Bash (3.2.57(1)-release (x86_64-apple-darwin20)) both syntax variations work just fine. What bash version are you using? – JESii Dec 17 '22 at 18:41
33

In bash:

URL='http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth'
URL_NOPRO=${URL:7}
URL_REL=${URL_NOPRO#*/}
echo "/${URL_REL%%\?*}"

Works only if URL starts with http:// or a protocol with the same length Otherwise, it's probably easier to use regex with sed, grep or cut ...

saeedgnu
  • 4,110
  • 2
  • 31
  • 48
  • 1
    A simple bash solution that doesn't require Ruby or Perl. Thanks! – Liam Sep 22 '12 at 20:49
  • 5
    I'll never understand why / when people post *brilliant* examples, **WITHOUT** the obvious inclusion of the `example output`. for example, here.. a simple line, is all that's needed... `↳/one/more/dir/file.exe` – Alex Gray Oct 02 '12 at 20:55
  • 5
    To force the lazy user to try it himself?! :D – saeedgnu Oct 07 '12 at 14:57
  • This is only gets me the file name. I don't understand how it's considered the solution. How do you get /one/more/dir/file.exe? – Glitches Jan 28 '15 at 07:28
  • @Glitches It prints out `/one/more/dir/file.exe`, you can put it into a variable if you want: `MYVAR="/one/more/dir/${AFTER_SLASH%%\?*}"` – saeedgnu Jan 28 '15 at 13:16
  • @ilius: I understand that but AFTER_SLASH is only the file name. It doesn't have the whole path. You need the whole path somehow before hand in order to build that string. How do you just get "/one/more/dir/file.exe" out of the URL? – Glitches Jan 28 '15 at 21:17
  • @Glitches You are right, I edited my answer (works only if URL starts with `http://` or a protocol with the same length) – saeedgnu Jan 29 '15 at 05:25
  • 1
    You can remove the protocol from the url regardless of length with `URL_NOPRO=${URL#*//}`. That will work with `http://`, `https://`, `ftp://`, though not with `file:///` (can’t handle 3 slashes). – Andrew Patton Mar 04 '15 at 19:15
10

This uses bash and cut as another way of doing this. It's ugly, but it works (at least for the example). Sometimes I like to use what I call cut sieves to whittle down the information that I am actually looking for.

Note: Performance wise, this may be a problem.

Given those caveats:

First let's echo the the line:

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth'

Which gives us:

http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth

Then let's cut the line at the @ as a convenient way to strip out the http://login:password:

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2

That give us this:

example.com/one/more/dir/file.exe?a=sth&b=sth

To get rid of the hostname, let's do another cut and use the / as the delimiter while asking cut to give us the second field and everything after (essentially, to the end of the line). It looks like this:

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2 | \
cut -d/ -f2-

Which, in turn, results in:

one/more/dir/file.exe?a=sth&b=sth

And finally, we want to strip off all the parameters from the end. Again, we'll use cut and this time the ? as the delimiter and tell it to give us just the first field. That brings us to the end and looks like this:

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2 | \
cut -d/ -f2- | \
cut -d? -f1

And the output is:

one/more/dir/file.exe

Just another way to do it and this approach is one way to whittle away that data you don't need in an interactive way to come up with something you do need.

If I wanted to stuff this into a variable in a script, I'd do something like this:

#!/bin/bash

url="http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth"
file_path=$(echo ${url} | cut -d@ -f2 | cut -d/ -f2- | cut -d? -f1)
echo ${file_path}

Hope it helps.

Jim
  • 1,217
  • 10
  • 16
7
url="http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth"

GNU grep

$ grep -Po '\w\K/\w+[^?]+' <<<$url
/one/more/dir/file.exe

BSD grep

$ grep -o '\w/\w\+[^?]\+' <<<$url | tail -c+2
/one/more/dir/file.exe

ripgrep

$ rg -o '\w(/\w+[^?]+)' -r '$1' <<<$url
/one/more/dir/file.exe

To get other parts of URL, check: Getting parts of a URL (Regex).

kenorb
  • 155,785
  • 88
  • 678
  • 743
  • That regex is the definition of "line noise", but it's short and works with all protocol types. – RonJohn May 31 '20 at 00:23
5

Using only bash builtins:

path="/${url#*://*/}" && [[ "/${url}" == "${path}" ]] && path="/"

What this does is:

  1. remove the prefix *://*/ (so this would be your protocol and hostname+port)
  2. check if we actually succeeded in removing anything - if not, then this implies there was no third slash (assuming this is a well-formed URL)
  3. if there was no third slash, then the path is just /

note: the quotation marks aren't actually needed here, but I find it easier to read with them in

caldfir
  • 354
  • 3
  • 7
3

How does this :?

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
sed 's|.*://[^/]*/\([^?]*\)?.*|/\1|g'
sed
  • 31
  • 1
2

gawk

echo "http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth" | awk -F"/" '
{
 $1=$2=$3=""
 gsub(/\?.*/,"",$NF)
 print substr($0,3)
}' OFS="/"

output

# ./test.sh
/one/more/dir/file.exe
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
2

If you have a gawk:

$ echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
  gawk '$0=gensub(/http:\/\/[^/]+(\/[^?]+)\?.*/,"\\1",1)'

or

$ echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | \
  gawk -F'(http://[^/]+|?)' '$0=$2'

Gnu awk can use regular expression as field separators(FS).

Hirofumi Saito
  • 321
  • 1
  • 5
2

The Perl snippet is intriguing, and since Perl is present in most Linux distros, quite useful, but...It doesn't do the job completely. Specifically, there is a problem in translating the URL/URI format from UTF-8 into path Unicode. Let me give an example of the problem. The original URI may be:

file:///home/username/Music/Jean-Michel%20Jarre/M%C3%A9tamorphoses/01%20-%20Je%20me%20souviens.mp3

The corresponding path would be:

/home/username/Music/Jean-Michel Jarre/Métamorphoses/01 - Je me souviens.mp3

%20 became space, %C3%A9 became 'é'. Is there a Linux command, bash feature, or Perl script that can handle this transformation, or do I have to write a humongous series of sed substring substitutions? What about the reverse transformation, from path to URL/URI?

(Follow-up)

Looking at http://search.cpan.org/~gaas/URI-1.54/URI.pm, I first saw the as_iri method, but that was apparently missing from my Linux (or is not applicable, somehow). Turns out the solution is to replace the "->path" part with "->file". You can then break that further down using basename and dirname, etc. The solution is thus:

path=$( echo "$url" | perl -MURI -le 'chomp($url = <>); print URI->new($url)->file' )

Oddly, using "->dir" instead of "->file" does NOT extract the directory part: rather, it formats the URI so it can be used as an argument to mkdir and the like.

(Further follow-up)

Any reason why the line cannot be shortened to this?

path=$( echo "$url" | perl -MURI -le 'print URI->new(<>)->file' )
Till
  • 22,236
  • 4
  • 59
  • 89
Urhixidur
  • 2,270
  • 2
  • 19
  • 24
1

Best bet is to find a language that has a URL parsing library:

url="http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth"
path=$( echo "$url" | ruby -ruri -e 'puts URI.parse(gets.chomp).path' )

or

path=$( echo "$url" | perl -MURI -le 'chomp($url = <>); print URI->new($url)->path' )
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
1

I wrote a function to that will extract any part or the URL. I've only tested it in bash. Usage:

url_parse <url> [url-part]

example:

$ url_parse "http://example.com:8080/home/index.html" path
home/index.html

code:

url_parse() {
  local -r url=$1 url_part=$2
  #define url tokens and url regular expression
  local -r protocol='^[^:]+' user='[^:@]+' password='[^@]+' host='[^:/?#]+' \
    port='[0-9]+' path='\/([^?#]*)' query='\?([^#]+)' fragment='#(.*)'
  local -r auth="($user)(:($password))?@"
  local -r connection="($auth)?($host)(:($port))?"
  local -r url_regex="($protocol):\/\/($connection)?($path)?($query)?($fragment)?$"
  #parse url and create an array
  IFS=',' read -r -a url_arr <<< $(echo $url | awk -v OFS=, \
    "{match(\$0,/$url_regex/,a);print a[1],a[4],a[6],a[7],a[9],a[11],a[13],a[15]}")

  [[ ${url_arr[0]} ]] || { echo "Invalid URL: $url" >&2 ; return 1 ; }

  case $url_part in
    protocol) echo ${url_arr[0]} ;;
    auth)     echo ${url_arr[1]}:${url_arr[2]} ;; # ex: john.doe:1234
    user)     echo ${url_arr[1]} ;;
    password) echo ${url_arr[2]} ;;
    host-port)echo ${url_arr[3]}:${url_arr[4]} ;; #ex: example.com:8080
    host)     echo ${url_arr[3]} ;;
    port)     echo ${url_arr[4]} ;;
    path)     echo ${url_arr[5]} ;;
    query)    echo ${url_arr[6]} ;;
    fragment) echo ${url_arr[7]} ;;
    info)     echo -e "protocol:${url_arr[0]}\nuser:${url_arr[1]}\npassword:${url_arr[2]}\nhost:${url_arr[3]}\nport:${url_arr[4]}\npath:${url_arr[5]}\nquery:${url_arr[6]}\nfragment:${url_arr[7]}";;
    "")       ;; # used to validate url
    *)        echo "Invalid URL part: $url_part" >&2 ; return 1 ;;
  esac
}
Mike
  • 791
  • 8
  • 18
1

I agree that "cut" is a wonderful tool on the command line. However, a more purely bash solution is to use a powerful feature of variable expansion in bash. For example:

pass_first_last='password,firstname,lastname'

pass=${pass_first_last%%,*}

first_last=${pass_first_last#*,}

first=${first_last%,*}

last=${first_last#*,}

or, alternatively,

last=${pass_first_last##*,}
Roger
  • 8,286
  • 17
  • 59
  • 77
-1

This perl one-liner works for me on the command line, so could be added to your script.

echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | perl -n -e 'm{http://[^/]+(/[^?]+)};print $1'

Note that this assumes there will always be a '?' character at the end of the string you want to extract.

ire_and_curses
  • 68,372
  • 23
  • 116
  • 141
  • Unfortunatelly ? character at the end is not always present in URLs, so I can't assume that. Ghostdog74's answer seems to be better. – Arek Jul 29 '09 at 15:57
  • I'm afraid Ghostdog74's answer also relies on the '?'. Try removing the '?' character from the url in the echo statement in that answer and you'll see that the result is incorrect. – ire_and_curses Jul 29 '09 at 18:45
  • Hmmm, I tested both answers now and both seem to produce correct result for me :) echo 'http://example.com/one/more/dir/file.exe' | perl -n -e 'm{http://[^/]+(/[^?]+)};print $1' produce: /one/more/dir/file.exe So for me it's correct. Now I have to pass that result to bash variable and finish my script. – Arek Jul 29 '09 at 20:13
  • gsub will do nothing if there is no ?. – ghostdog74 Jul 30 '09 at 00:00