3

The input content is a chunk of html copied from webkit window, like

enter image description here

It's displayed correctly in web-kit using UTF-8.

What I want to do is to replace all the tags, I use this one-liner:

perl -i -pe "s/<img.+?>//g"

The input is the richtext I copied to my clipboard and redirected into this one-liner by another program, probably it's something like:

echo "rich html text" | perl -i -pe "s/<img.+?>//g"

Well, it does remove the <img> tags, but all the Unicode characters get corrupted after substitution.

enter image description here

I am on Windows 7, locale En - US. The cmd codepage has already been set to UTF-8. It doesn't work even if I pass the -C option.

Is there a way to keep the code as one-liner while make it working for Unicode input?

Sawyer
  • 15,581
  • 27
  • 88
  • 124

2 Answers2

0

You could try inserting this in your perl one liner:

use open ":encoding(utf8)";

You can probably add it via -M:

perl -Mopen=:encoding(utf8) -i -pe "s/<img.+?>//g"

(Thanks to @TLP for reminding me of the syntax).

See also the open pragma

harmic
  • 28,606
  • 5
  • 67
  • 91
  • The switch syntax for that line would be `-Mopen=:encoding(utf8)` – TLP Feb 14 '14 at 09:09
  • I forgot to mention, the input is not a file, it's the text I copied into my clipboard and piped into this one-liner by another program. I tried this `perl -i -pe "use open ':encoding(utf8)'; s///g"`, it doesn't even do the replacement, I don't know where is wrong. :( – Sawyer Feb 14 '14 at 09:43
0

perl -COE -i -pe "s/<img.+?>//g" input should work, the -COE option turns on unicode on both STDIN and STDOUT.

See perldoc perlrun for more details.

mirod
  • 15,923
  • 3
  • 45
  • 65