1

how can I remove links from a raw html text? I've got:

Foo bar <a href="http://www.foo.com">blah</a> bar foo 

and want to get:

Foo bar blah bar foo

afterwards.

Chase Florell
  • 46,378
  • 57
  • 186
  • 376
FooBar
  • 11
  • 1
  • are you working with a particular language – spinon Jul 04 '10 at 23:08
  • Is it from a text file, with a handful of links, or is it fully generic html? If the latter and you just want something quick and cheap, look into `w3m -dump` or `lynx -dump`. If you want a repeatable or configurable tool, Brian's answer is right, find an HTML parser for the environment you want to use. – sarnold Jul 04 '10 at 23:16
  • @spinon - he's using "SED" [Stream Editor] - UNIX... @Marko... putting REGEX at the beginning of his question won't solve his problem – Chase Florell Jul 04 '10 at 23:16
  • yeah I figured it was something I wasn't familiar with. Because surely that wasn't a made up tag. I obviously know nothing about that. :) – spinon Jul 04 '10 at 23:18
  • 4
    Only Chuck Norris can parse HTML with a regex. – nas Jul 04 '10 at 23:26
  • @Marko: You shouldn't necessarily prefix your question title with something — that's what tags are for. – icktoofay Jul 04 '10 at 23:50

4 Answers4

2

You're looking to parse HTML with regexps, and this won't work in all but the simplest cases, since HTML isn't regular. A much more reliable solution is to use an HTML parser. Numerous exist, for many different languages.

Community
  • 1
  • 1
Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
2
sed -re 's|<a [^>]*>([^<]*)</a>|\1|g'

But Brian's answer is right: This should only be used in very simple cases.

danlei
  • 14,121
  • 5
  • 58
  • 82
0

try with:

sed -e 's/<a[^>]*>.*<\/a>//g' test.txt
patrick
  • 6,533
  • 7
  • 45
  • 66
  • This would produce "Foo bar bar foo" instead of "Foo bar blah bar foo" for the example in question. See danlei's solution for the correct version. – Bolo Jul 04 '10 at 23:28
0

$ echo 'Foo bar <a href="http://www.foo.com">blah</a> bar foo' | awk 'BEGIN{RS="</a>"}/<a href/{gsub(/<a href=\042.*\042>/,"")}1'

Foo bar blah bar foo

ghostdog74
  • 327,991
  • 56
  • 259
  • 343