How could I remove links from text? I think that I should use sed command but I don't know exact syntactics.
Asked
Active
Viewed 474 times
0
-
You should show an example of what you have and what you want. Do you mean HTML links? What do you want to do with the rest of the HTML in the file? You should use a Perl or Python lib or another tool that is specialized for manipulating HTML. Regular expressions are [insufficient](http://stackoverflow.com/q/1732348/26428#1732454). – Dennis Williamson Nov 24 '10 at 17:22
-
possible duplicate of [Find Links and Remove them from HTML](http://stackoverflow.com/questions/1784507/find-links-and-remove-them-from-html) – Dennis Williamson Nov 24 '10 at 17:24
-
My text looks like this: lallalalala http://blabla.com babababab http://hehehe.org. – llokely Nov 25 '10 at 10:58
-
possible duplicate of [sed to remove URLs from a file](http://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file) – johnsyweb Nov 27 '10 at 05:58
1 Answers
0
This will remove everything ending in .com
or .org
:
sed 's/\s\?\w\+\.\(com\|org\)//g' foo.txt
input:
lallalalala blabla.com babababab hehehe.org.
output:
lallalalala babababab.
EDIT: here it is in POSIX standard. I also added some more characters to match cases where there may be sub-domains or protocols (http://
)
sed 's/[[:space:]]\?[A-Za-z0-9_\/\:\.-]\+\.\(com\|org\)//g' foo.txt
Also note that this does not cover all possible URL characters or URLs that reference a resource after the domain suffix (example.com/query?foo=bar
).

Brian Clements
- 3,787
- 1
- 25
- 26
-
Note, this also removes a whitespace before the url if it exists. If this isn't desired, remove the `\s\?` part. – Brian Clements Nov 27 '10 at 03:33
-
It also assumes GNU sed - not necessarily invalid, but should be documented as using a non-standard extension. – Jonathan Leffler Nov 27 '10 at 03:57