Extract string from string using RegEx in the Terminal

Question

I have a string like first url, second url, third url and would like to extract only the url after the word second in the OS X Terminal (only the first occurrence). How can I do it?

In my favorite editor I used the regex /second (url)/ and used $1 to extract it, I just don't know how to do it in the Terminal.

Keep in mind that url is an actual url, I'll be using one of these expressions to match it: Regex to match URL

Sjoerd · Accepted Answer · 2010-08-20T16:28:56.693

74

echo 'first url, second url, third url' | sed 's/.*second//'

Edit: I misunderstood. Better:

echo 'first url, second url, third url' | sed 's/.*second \([^ ]*\).*/\1/'

or:

echo 'first url, second url, third url' | perl -nle 'm/second ([^ ]*)/; print $1'

edited Aug 20 '10 at 16:28

answered Aug 20 '10 at 16:13

Sjoerd

74,049
16
131
175

That returns ` url, third url` =/ – fregante Aug 20 '10 at 16:20
The third command works best (no need to escape parenthesis and such = great) but it returns all the occurrences (with my input, a long file, 13 times each), I would just need the first – fregante Aug 20 '10 at 16:55
I added an inelegant `| sed -n '1 s/\./\./p'` after the perl command and everything worked out fine =D Thank you! – fregante Aug 20 '10 at 17:57
where can I see what the options -nle mean? – arod Feb 08 '16 at 18:51
[Perl One Liners](http://novosial.org/perl/one-liner/) – Sjoerd Feb 11 '16 at 16:19
@bfred.it Pass `-E` or `-r` to `sed`, and you don't need to escape parenthesis. – x-yuri Jun 22 '16 at 22:24

Dmitry Shevkoplyas · Answer 2 · 2020-11-17T15:51:24.590

45

Piping to another process (like 'sed' and 'perl' suggested above) might be very expensive, especially when you need to run this operation multiple times. Bash does support regexp:

[[ "string" =~ regex ]]

Similarly to the way you extract matches in your favourite editor by using $1, $2, etc., Bash fills in the $BASH_REMATCH array with all the matches.

In your particular example:

str="first url1, second url2, third url3"
if [[ $str =~ (second )([^,]*) ]]; then
  echo "match: '${BASH_REMATCH[2]}'"
else
  echo "no match found"
fi

Output:

match: 'url2'

Specifically, =~ supports extended regular expressions as defined by POSIX, but with platform-specific extensions (which vary in extent and can be incompatible).
On Linux platforms (GNU userland), see man grep; on macOS/BSD platforms, see man re_format.

edited Nov 17 '20 at 15:51

answered Jun 12 '17 at 02:31

Dmitry Shevkoplyas

6,163
3
27
28

1

Why would you state that "piping to another process might be very expensive"? – akauppi Jul 20 '18 at 08:10
11

@akauppi, there's a cost to start a new process (OS must allocate lots of things, run some checks, setup environment, do some disk IO, switch context, etc. etc. Then after your external process is done there's some extra cleanup as well). For simplicity lets benchmark. We'll process 14 thousand lines CSV ascii file. Using "bash support regexp" it takes 1 second! and if I iterate lines and call "sed" for each line then it takes 42 seconds. I need to process 1.4M lines, so it would make it: 100seconds "bash way" vs 1 hour 10 min piping to "sed"! Feel the difference! (c) :) – Dmitry Shevkoplyas Jul 20 '18 at 14:52
You convinced me to revisit a code I did this week. :) – akauppi Jul 20 '18 at 17:03

score 12 · Answer 3 · answered Aug 20 '10 at 16:24

In the other answer provided you still remain with everything after the desired URL. So I propose you the following solution.

echo 'first url, second url, third url' | sed 's/.*second \(url\)*.*/\1/'

Under sed you group an expression by escaping the parenthesis around it (POSIX standard).

score 7 · Answer 4 · answered May 01 '17 at 10:50

While trying this, what you probably forgot was the -E argument for sed.

From sed --help:

  -E, -r, --regexp-extended
                 use extended regular expressions in the script
                 (for portability use POSIX -E).

You don't have to change your regex significantly, but you do need to add .* to match greedily around it to remove the other part of string.

This works fine for me:

echo "first url, second url, third url" | sed -E 's/.*second (url).*/\1/'

Output:

url

In which the output "url" is actually the second instance in the string. But if you already know that it is formatted in between comma and space, and you don't allow these characters in URLs, then the regex [^,]* should be fine.

Optionally:

echo "first http://test.url/1, second ://test.url/with spaces/2, third ftp://test.url/3" \
     | sed -E 's/.*second ([a-zA-Z]*:\/\/[^,]*).*/\1/'

Which correctly outputs:

://example.com/with spaces/2

Extract string from string using RegEx in the Terminal

4 Answers4

Linked

Related