60

I have a string like first url, second url, third url and would like to extract only the url after the word second in the OS X Terminal (only the first occurrence). How can I do it?

In my favorite editor I used the regex /second (url)/ and used $1 to extract it, I just don't know how to do it in the Terminal.

Keep in mind that url is an actual url, I'll be using one of these expressions to match it: Regex to match URL

Community
  • 1
  • 1
fregante
  • 29,050
  • 14
  • 119
  • 159

4 Answers4

74
echo 'first url, second url, third url' | sed 's/.*second//'

Edit: I misunderstood. Better:

echo 'first url, second url, third url' | sed 's/.*second \([^ ]*\).*/\1/'

or:

echo 'first url, second url, third url' | perl -nle 'm/second ([^ ]*)/; print $1'
Sjoerd
  • 74,049
  • 16
  • 131
  • 175
45

Piping to another process (like 'sed' and 'perl' suggested above) might be very expensive, especially when you need to run this operation multiple times. Bash does support regexp:

[[ "string" =~ regex ]]

Similarly to the way you extract matches in your favourite editor by using $1, $2, etc., Bash fills in the $BASH_REMATCH array with all the matches.

In your particular example:

str="first url1, second url2, third url3"
if [[ $str =~ (second )([^,]*) ]]; then
  echo "match: '${BASH_REMATCH[2]}'"
else
  echo "no match found"
fi

Output:

match: 'url2'

Specifically, =~ supports extended regular expressions as defined by POSIX, but with platform-specific extensions (which vary in extent and can be incompatible).
On Linux platforms (GNU userland), see man grep; on macOS/BSD platforms, see man re_format.

Dmitry Shevkoplyas
  • 6,163
  • 3
  • 27
  • 28
  • 1
    Why would you state that "piping to another process might be very expensive"? – akauppi Jul 20 '18 at 08:10
  • 11
    @akauppi, there's a cost to start a new process (OS must allocate lots of things, run some checks, setup environment, do some disk IO, switch context, etc. etc. Then after your external process is done there's some extra cleanup as well). For simplicity lets benchmark. We'll process 14 thousand lines CSV ascii file. Using "bash support regexp" it takes 1 second! and if I iterate lines and call "sed" for each line then it takes 42 seconds. I need to process 1.4M lines, so it would make it: 100seconds "bash way" vs 1 hour 10 min piping to "sed"! Feel the difference! (c) :) – Dmitry Shevkoplyas Jul 20 '18 at 14:52
  • You convinced me to revisit a code I did this week. :) – akauppi Jul 20 '18 at 17:03
12

In the other answer provided you still remain with everything after the desired URL. So I propose you the following solution.

echo 'first url, second url, third url' | sed 's/.*second \(url\)*.*/\1/'

Under sed you group an expression by escaping the parenthesis around it (POSIX standard).

mhitza
  • 5,709
  • 2
  • 29
  • 52
7

While trying this, what you probably forgot was the -E argument for sed.

From sed --help:

  -E, -r, --regexp-extended
                 use extended regular expressions in the script
                 (for portability use POSIX -E).

You don't have to change your regex significantly, but you do need to add .* to match greedily around it to remove the other part of string.

This works fine for me:

echo "first url, second url, third url" | sed -E 's/.*second (url).*/\1/'

Output:

url

In which the output "url" is actually the second instance in the string. But if you already know that it is formatted in between comma and space, and you don't allow these characters in URLs, then the regex [^,]* should be fine.

Optionally:

echo "first http://test.url/1, second ://test.url/with spaces/2, third ftp://test.url/3" \
     | sed -E 's/.*second ([a-zA-Z]*:\/\/[^,]*).*/\1/'

Which correctly outputs:

://example.com/with spaces/2
Yeti
  • 2,647
  • 2
  • 33
  • 37