Remove links from text

Question

How could I remove links from text? I think that I should use sed command but I don't know exact syntactics.

You should show an example of what you have and what you want. Do you mean HTML links? What do you want to do with the rest of the HTML in the file? You should use a Perl or Python lib or another tool that is specialized for manipulating HTML. Regular expressions are [insufficient](http://stackoverflow.com/q/1732348/26428#1732454). — Dennis Williamson, Nov 24 '10 at 17:22
possible duplicate of [Find Links and Remove them from HTML](http://stackoverflow.com/questions/1784507/find-links-and-remove-them-from-html) — Dennis Williamson, Nov 24 '10 at 17:24
My text looks like this: lallalalala http://blabla.com babababab http://hehehe.org. — llokely, Nov 25 '10 at 10:58
possible duplicate of [sed to remove URLs from a file](http://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file) — johnsyweb, Nov 27 '10 at 05:58

Brian Clements · Accepted Answer · 2010-11-27T04:39:31.880

0

This will remove everything ending in .com or .org:

sed 's/\s\?\w\+\.\(com\|org\)//g' foo.txt

input:

lallalalala blabla.com babababab hehehe.org.

output:

lallalalala babababab.

EDIT: here it is in POSIX standard. I also added some more characters to match cases where there may be sub-domains or protocols (http://)

sed 's/[[:space:]]\?[A-Za-z0-9_\/\:\.-]\+\.\(com\|org\)//g' foo.txt

Also note that this does not cover all possible URL characters or URLs that reference a resource after the domain suffix (example.com/query?foo=bar).

edited Nov 27 '10 at 04:39

answered Nov 27 '10 at 03:30

Brian Clements

3,787
1
25
26

Note, this also removes a whitespace before the url if it exists. If this isn't desired, remove the `\s\?` part. – Brian Clements Nov 27 '10 at 03:33
It also assumes GNU sed - not necessarily invalid, but should be documented as using a non-standard extension. – Jonathan Leffler Nov 27 '10 at 03:57

Remove links from text

1 Answers1