Perl in-place editing messes up text encoding

Question

The input content is a chunk of html copied from webkit window, like

enter image description here

It's displayed correctly in web-kit using UTF-8.

What I want to do is to replace all the tags, I use this one-liner:

perl -i -pe "s/<img.+?>//g"

The input is the richtext I copied to my clipboard and redirected into this one-liner by another program, probably it's something like:

echo "rich html text" | perl -i -pe "s/<img.+?>//g"

Well, it does remove the <img> tags, but all the Unicode characters get corrupted after substitution.

enter image description here

I am on Windows 7, locale En - US. The cmd codepage has already been set to UTF-8. It doesn't work even if I pass the -C option.

Is there a way to keep the code as one-liner while make it working for Unicode input?

Instead of perl, output to stdout or a file to be sure this is not your first program which messes the content at first. — Mat M, Apr 23 '14 at 10:20

harmic · Answer 1 · 2014-02-14T10:38:41.663

0

You could try inserting this in your perl one liner:

use open ":encoding(utf8)";

You can probably add it via -M:

perl -Mopen=:encoding(utf8) -i -pe "s/<img.+?>//g"

(Thanks to @TLP for reminding me of the syntax).

edited Feb 14 '14 at 10:38

answered Feb 14 '14 at 09:07

harmic

The switch syntax for that line would be `-Mopen=:encoding(utf8)` – TLP Feb 14 '14 at 09:09
I forgot to mention, the input is not a file, it's the text I copied into my clipboard and piped into this one-liner by another program. I tried this `perl -i -pe "use open ':encoding(utf8)'; s///g"`, it doesn't even do the replacement, I don't know where is wrong. :( – Sawyer Feb 14 '14 at 09:43

score 0 · Answer 2 · answered Feb 14 '14 at 09:20

0

perl -COE -i -pe "s/<img.+?>//g" input should work, the -COE option turns on unicode on both STDIN and STDOUT.

See perldoc perlrun for more details.

answered Feb 14 '14 at 09:20

mirod

It doesn't work, I tried all the -C options. makes no differences. – Sawyer Feb 14 '14 at 09:31
even -CD? I realized that you weren't working on STDIN/STDOUT, but rather on a file. – mirod Feb 14 '14 at 10:16

2 Answers2