6

I am trying to make the following regular expressions to work in sed command in bash.

^[^<]?(https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&\/\/=]*))[^>]?$

I know the regular expression is correct and it is working as I expected. So; there is no help needed with that. I tested it on online regular expressions tester and it is working as per my expectations.

Please find the demo of the above regex in here.

My requirement: I want to enclose every url inside <>. If the url is already enclosed; then append it to the result as can be seen in the above regex link.

Sample Input:(in file named website.txt)

// List of all legal urls
https://www.google.com/
https://www.fakesite.co.in
https://www.fakesite.co.uk
<https://www.fakesite.co.uk>
<https://www.google.com/>

Expected Output:(in the file named output.txt)

<https://www.google.com/> // Please notice every url is enclosed in the <>.
<https://www.fakesite.co.in>
<https://www.fakesite.co.uk>
<https://www.fakesite.co.uk> // Please notice if the url is already enclosed in <> then it is appended as it is.
<https://www.google.com/>

What I tried in sed:

  1. Since I'm not well-versed in bash commands; so previously I was not able to capture the group properly in sed but after reading this answer; I figured out that we need to escape the parenthesis to be able to capture it.

  2. Somewhere; I read that look-arounds are not supported in sed(GNU based) so I removed lookarounds too; but that also didn't worked. If it doesn't support look-arounds then I used this regex and it served my purpose.

  3. Then; this is my latest try with sed command:

    sed 's@^[^<]?(https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&/=]*))[^>]?$@<\1>@gm;t;d' websites.txt > output.txt
    

My exact problem:

How can I make the above command to work properly. If you'll run the command sample I attached above in point-3; you'd see it is not replacing the contents properly. It is just dumping the contents of websites.txt to output.txt. But in regex demo; attached above it is working properly i.e. enclosing all the unenclosed websites inside <>. Any suggestions would be helpful. I preferably want it in sed but if it is possible can I convert the above command in awk also? If you can please help me with that too; I'll be highly obliged. Thanks

  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/216358/discussion-on-question-by-mandy8055-unable-to-make-the-mentioned-regular-express). – Samuel Liew Jun 21 '20 at 08:25

3 Answers3

4

After working for long, I made my sed command to work. Below is the command which worked.

sed -E 's@^[^<]?(https?://(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&=]*))[^>]?$@<\1>@gm;t' websites.txt > output.txt

You can find the sample implementation of the command in here.

Since, the regex has already fulfilled the requirement of the person for whom I'm writing this requirement for; I needed to get help only regarding the command syntax (although any improvements are heartily welcomed); I want the command to work with the same regular expression pattern.

Things which I was unaware previously and learnt now:

  1. I didn't knew anything about -E flag. Now I know; that -E uses POSIX "extended" syntax ("ERE"). Thanks to @GordonDavisson and @Sundeep. Further reading.

  2. I didn't know with clarity that sed doesn't supports look-around. But now I know sed doesn't support look-around. Thanks to @dmitri-chubarov. Further reading

  3. I didn't knew sed doesn't support non-capturing groups too. Thanks to @Sundeep for solving this part. Further Reading

  4. I didn't knew about GNU sed as a specific command line tool. Thanks to @oguzismail for this. Further reading.

  • 1
    I would extend my hearty regards to @GordonDavisson, Sundeep, dmitri-chubarov, oguzismail, RavinderSingh13 for helping me out with this solution. –  Jun 21 '20 at 10:22
2

With respect to the command in your answer:

sed -E 's@^[^<]?(https?://(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&=]*))[^>]?$@<\1>@gm;t'

Here's a few notes:

Your posted sample input has 1 URL per line so AFAIK the gm;t at the end of your sed command is doing nothing useful so either your input is inadequate or your script is wrong.

The hard-coded ranges a-z, A-Z, and 0-9 include different characters in different locales. If you meant to include all (and only) lower case letters, upper case letters, and digits then you should replace a-zA-Z0-9 with the POSIX character class [:alnum:]. So either change to use a locale-independent character class or specify the locale you need on your command line depending in your requirements for which characters to match in your regexp.

Like most characters, the character + is literal inside a bracket expression so it shouldn't be escaped - change \+ to just +.

The bracket expression [^<]? means "1 or 0 occurrences of any character that is not a <" and similarly for [^>]? so if your "url" contained random characters at the start/end it'd be accepted, e.g.:

echo 'xhttp://foo.bar%' | sed -E 's@^[^<]?(https?://(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&=]*))[^>]?$@<\1>@gm;t'
<http://foo.bar%>

I think you meant to use <? and >? instead of [^<]? and [^>]?.

Your regexp would allow a "url" that has no letters:

echo 'http://=.9' | gsed -E 's@^[^<]?(https?://(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&=]*))[^>]?$@<\1>@gm;t'
<http://=.9>

If you edit your question to provide more truly representative sample input and expected output (including cases you do not want to match) then we can help you BUT based on a quick google of what a valid URL is it looks like there are several valid URLs that'd be disallowed by your regexp and several invalid ones that'd be allowed so you might want to ask about that in a question tagged with url or similar (with the tags you currently have we can help you implement your regexp but there may be better people to help with defining your regexp).

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • *So either change to use a locale-independent character class* Can you elaborate on this? Do you mean like `[abcdefghijklmno...]`? – oguz ismail Jun 21 '20 at 14:59
  • 1
    @oguzismail No, in POSIX terminology (which is far clearer than the muddy pre-POSIX terms) that's a character list inside a bracket expression, not a character class. A character class would be `[:alnum:]` and if used inside a bracket expression it'd be `[[:alnum:]]`. I'm just saying if you want to use `a-z` to mean something specific then set LC_ALL or similar to define what it means otherwise your code will behave differently in different locales. – Ed Morton Jun 21 '20 at 15:01
  • I got you @EdMorton.Yes there are other cases also which I found. I mentioned this to my client and also asked to use `perl` instead of `sed` for creating more failproof regex. I am waiting for his approval. I can fix the regex myself. But then for more edge cases(**which you mentioned**) I might require look-arounds. I exactly got your points. Thanks a lot for this answer. One new thing I learnt; is to use `[[:alnum:]]` for locale specific values. I'll surely keep this in mind for future purposes. Thanks a lot. If he accepts using perl; I'll surely share the updated regex with you. –  Jun 22 '20 at 04:24
  • 1
    You're welcome. I don't know why using perl would make things any easier but if your client accepts using perl and you create a PCRE then I won't be of any further help as I stick to standard UNIX tools (e.g. sed and awk) with BREs and EREs so I wouldn't know what a perl script or a PCRE meant well enough to be able to offer any advice. Others could, of course. Good luck with whatever you end up with. – Ed Morton Jun 22 '20 at 04:36
  • 1
    FWIW I just googled "regular expression to match a url" and found this, in case it helps: https://stackoverflow.com/q/161738/1745001 – Ed Morton Jun 22 '20 at 04:42
  • Sir @EdMorton `perl` supports lookarounds, as well as non-capturing groups in regex that's why I thought; it will be a better alternative for writing strong REs. I might be wrong though. Please correct me if that is the case. And your second link is very amazing. I'll read all answers there and figure out the best one. And sir your answers are very descriptive; thanks for enlightening me. –  Jun 22 '20 at 05:13
1

If the input file is just a comment followed by a list of URLs, try:

sed '1d;s/^[^<]/<&/;s/[^>]$/&>/' websites.txt

Output:

<https://www.google.com/>
<https://www.fakesite.co.in>
<https://www.fakesite.co.uk>
<https://www.fakesite.co.uk>
<https://www.google.com/>
agc
  • 7,973
  • 2
  • 29
  • 50
  • 1
    @Mandy8055, Re *"...to tamper..."*: Stack Overflow is about learning code with open source examples. Learning code turns students into programmers, or authors of code. Programmers of open source have **author**ity to change code as they please -- whether the resulting code itself is buggy or brilliant, that's the opposite of tampering. – agc Jun 21 '20 at 13:32