14

I am trying to write a sed expression that can remove urls from a file

example

http://samgovephotography.blogspot.com/ updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)   

Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N https://hollywoodmomblog.com/?p=2442 Thx to HMB Contributor @kdpartak :)   

But I dont get it:

sed 's/[\w \W \s]*http[s]*:\/\/\([\w \W]\)\+[\w \W \s]*/ /g' posFile  

FIXED!!!!!

handles almost all cases, even malformed URLs

sed 's/[\w \W \s]*http[s]*[a-zA-Z0-9 : \. \/ ; % " \W]*/ /g' positiveTweets | grep "http" | more
daydreamer
  • 87,243
  • 191
  • 450
  • 722
  • 2
    When working with urls, file paths, etc, I prefer using "|" as sed separator so I dont have to escape /. Example: sed 's|/path/to/some/file/|/newpath/to/new/file/|g' –  Nov 26 '10 at 09:55
  • @JP19, like it, would try this out – daydreamer Nov 26 '10 at 22:38

2 Answers2

14

The following removes http:// or https:// and everything up until the next space:

sed -e 's!http\(s\)\{0,1\}://[^[:space:]]*!!g' posFile  
 updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)   

Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N  Thx to HMB Contributor @kdpartak :)

Edit:

I should have used:

sed -e 's!http[s]\?://\S*!!g' posFile

"[s]\?" is a far more readable way of writing "an optional s" compared to "\(s\)\{0,1\}"

"\S*" a more readable version of "any non-space characters" than "[^[:space:]]*"

I must have been using the sed that came installed with my Mac at the time I wrote this answer (brew install gnu-sed FTW).


There are better URL regular expressions out there (those that take into account schemes other than HTTP(S), for instance), but this will work for you, given the examples you give. Why complicate things?

johnsyweb
  • 136,902
  • 23
  • 188
  • 247
  • 1
    Johnsyweb could you please explain your sed expression? Particularly the {0,1} notation. – minerals Apr 11 '13 at 13:00
  • 1
    Thanks for the Mac comment. I was testing what was totally valid regex on my mac for 10 mins before I read your answer and tried it on a centos box, where it worked first time. – seeafish Oct 21 '16 at 11:42
  • 1
    For anyone wondering about the `'s! ... !!g'` bit in the edited answer, it just appears to be a way of escaping the enclosed text. From my tests, `sed -e 's!http[s]\?://\S*!!g'` appears to be the same as `sed -e 's/http[s]\?:\/\/\S*//g'` – Victoria Stuart Dec 14 '17 at 21:26
1

The accepted answer provides the approach that I used to remove URLs, etc. from my files. However it left "blank" lines. Here is a solution.

sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' input_file

perl -i -pe 's/^'`echo "\012"`'${2,}//g' input_file

The GNU sed flags, expressions used are:

-i    Edit in-place
-e    [-e script] --expression=script : basically, add the commands in script
      (expression) to the set of commands to be run while processing the input
 ^    Match start of line
 $    Match end of line


 ?    Match one or more of preceding regular expression
{2,}  Match 2 or more of preceding regular expression
\S*   Any non-space character; alternative to: [^[:space:]]*

However,

sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g'

leaves nonprinting character(s), presumably \n (newlines). Standard sed-based approaches to remove "blank" lines, tabs and spaces, e.g.

sed -i 's/^[ \t]*//; s/[ \t]*$//'

do not work, here: if you do not use a "branch label" to process newlines, you cannot replace them using sed (which reads input one line at a time).

The solution is to use the following perl expression:

perl -i -pe 's/^'`echo "\012"`'${2,}//g'

which uses a shell substitution,

  • '`echo "\012"`'

to replace an octal value

  • \012

(i.e., a newline, \n), that occurs 2 or more times,

  • {2,}

(otherwise we would unwrap all lines), with something else; here:

  • //

i.e., nothing.

[The second reference below provides a wonderful table of these values!]

The perl flags used are:

-p  Places a printing loop around your command,
    so that it acts on each line of standard input

-i  Edit in-place

-e  Allows you to provide the program as an argument,
    rather than in a file

References:


Example:

$ cat url_test_input.txt

Some text ...
https://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file
https://www.google.ca/search?dcr=0&ei=QCsyWtbYF43YjwPpzKyQAQ&q=python+remove++citations&oq=python+remove++citations&gs_l=psy-ab.3...1806.1806.0.2004.1.1.0.0.0.0.61.61.1.1.0....0...1c.1.64.psy-ab..0.0.0....0.-cxpNc6youY
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
http://datasynce.org/2017/05/sentiment-analysis-on-python-through-textblob/
https://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
http://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
ftp://ftp.ncbi.nlm.nih.gov/
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/alignment_indices/20100804.alignment.index
Some more text.

$ sed -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' url_test_input.txt > a

$ cat a

Some text ...










Some more text.

$ perl -i -pe 's/^'`echo "\012"`'${2,}//g' a

Some text ...
Some more text.

$ 
Victoria Stuart
  • 4,610
  • 2
  • 44
  • 37