2

I have the following regex.

/http:\/\/([a-zA-Z0-9\-]+\.)+[a-zA-Z0-9\-]+:[a-zA-Z0-9\-]+\/[a-zA-Z]+\.[a-zA-Z]+/g

Which identifies matching URL's (https://regex101.com/r/sG9zR7/1). I need to modify it in order to be able to use it on the command line so it prints out the results. so I modified it to following

sed -n 's/.*\(http:\/\/\([a-zA-Z0-9\-]+\.\)+[a-zA-Z0-9\-]+:[a-zA-Z0-9\-]+\/[a-zA-Z]+\.[a-zA-Z]+\).*/\1/p' filename 

(I was trying to add bold to the characters added but could not) there were as follows

sed -n 's/.*( (in the beginning )

\ (For the inner parenthesis)

).*/\1/p' filename (at the end)

However, i get no results when i execute it.

hjpotter92
  • 78,589
  • 36
  • 144
  • 183
user68650
  • 115
  • 12
  • See http://stackoverflow.com/questions/29613304/is-it-possible-to-escape-regex-metacharacters-reliably-with-sed/29626460#29626460 and post some testable sample input and expected output. Also, you do not need to escape `-` at the start or end of a bracket expression and you should be using POSIX character classes instead of hard-coded character ranges (which are locale-dependent) so your regexp should be `/http:\/\/([[:alnum:]-]+\.)+[[:alnum:]-]+:[[:alnum:]-]+\/[[:alpha:]]+\.[[:alpha:]]+/g` and note that `+` requires EREs so sed will need the `-r` flag or escape every `+`: `\+`. – Ed Morton Sep 19 '15 at 18:41

3 Answers3

1

Make it a habit to use a delimiter other that / when dealing with URLs. It makes the pattern easier to read.

sed -r -n 's~.*\(http://\([a-z0-9\-]+\.\)+[a-z0-9\-]+:[a-z0-9\-]+/[a-z]+\.[a-z]+\).*~\1~ip' file

Note that I use i modifier for ignorecase.

As hwnd comments, you should put -r flag to sed command as well since your pattern requires + to be treated in a special manner.

Community
  • 1
  • 1
hjpotter92
  • 78,589
  • 36
  • 144
  • 183
  • I like the "i" I was unaware of that. Regex seems to work with the exception of the misses,which i understand. as is; i get 192 matches. I have to see what the ?: does. that was all that was added to the expression. the new expression return 0 hits. – user68650 Sep 19 '15 at 18:31
  • @user68650 ignore the `?:` in pattern. It got carried forward from my test cases. I have also removed it above – hjpotter92 Sep 19 '15 at 18:33
  • i had to start over with the guidance provided - the above sample did not work when i applied the -r has something to do when ans where the escpare sewquence was being applied. so i had to test each one. this is the working command --- sed -rn 's~.*(http://([a-z0-9\-]+.)*[a-z0-9\-]+:[0-9]+\/[a-z0-9]+.[a-z]+).*~\1~ip' Filename – user68650 Sep 19 '15 at 21:01
0

sed -rn 's~.*(http://([a-z0-9\-]+.)*[a-z0-9\-]+:[0-9]+\/[a-z0-9]+.[a-z]+).*~\1~ip' Filename is the working command. With the assistance of the sample supplied (thank you hjpotler92) I was able to figure out the escape character did not need to be applies to certain characters. Will have to find out when and how it is applied when using the -r option.

user68650
  • 115
  • 12
0

You can achieve the same with an xpath query via xidel:

xidel file.html -e '//a/@href[fn:matches(.,"http://[^/]*:")]/fn:substring-after(.,"=")'
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125