UNIX/Linux shell script: Removing variant form emoji from a text

Question

Consider you are using a Linux/UNIX shell whose default character set is UTF-8:

$ echo $LANG
en_US.UTF-8

You have a text file, emoji.txt, which is coded in UTF-8:

$ file -i ./emoji.txt
./emoji.txt: text/plain; charset=utf-8

This text file contains some emoji and a variant form escape sequence:

$ cat     ./emoji.txt
Standard ☁
Variant form ☁️

$ uni2ascii -a B -q ./emoji.txt
Standard \x2601
Variant form \x2601\xFE0F

You want to remove both emoji, including that variant form character (\xFE0F), and so the output should be

Standard 
Variant form

How would you do this?

Update. This question is not about how to remove the last word in every line. Imagine emoji2.txt that includes a large text with many emoji characters; and some of these are followed by the variant form sequence.

This [answer](https://stackoverflow.com/a/67495684/1836776) may also help — Marc Durdin, May 11 '21 at 23:55

M. Nejat Aydin · Accepted Answer · 2020-08-13T16:56:03.787

1

With GNU sed and bash:

  sed -E s/$'\u2601\uFE0F?'//g emoji.txt

edited Aug 13 '20 at 16:56

answered Aug 10 '20 at 21:46

M. Nejat Aydin

9,597
1
7
17

In Z shell (zsh), replace `?` with `\?`. – Culip Aug 13 '20 at 16:41
1

@Culip You're right. It is better to put the `?` inside the `$' '` in `bash` as well. Otherwise it will be interpreted as a filename matching pattern. I've corrected it. – M. Nejat Aydin Aug 13 '20 at 16:55

score 1 · Answer 2 · answered Aug 10 '20 at 22:01

You can use awk, like this:

$ cat emo.ascii 
Standard \x2601
Variant form \x2601\xFE0F
$ ascii2uni -a B emo.ascii                                  
Standard ☁
Variant form ☁️
3 tokens converted # note: this is stderr
$ ascii2uni -a B emo.ascii | awk -F' ' '{NF--}1' | cat -A 
3 tokens converted # note: this is stderr
Standard$
Variant form$

NF-- will decrease the field count in awk, which effectively removes the last field. 1 evaluates to true, which makes awk print the modified line.

(Used cat -A here only to show that there aren't any invisible characters left)

Culip · Answer 3 · 2020-08-13T16:38:49.020

0

Convert the Unicode text file to ASCII and remove those Unicode characters that are represented by ASCII characters, and convert it to UTF-8 again:

$ uni2ascii -q ./emoji.txt | sed "s/ 0x2601\(0xFE0F\)\?//g" | ascii2uni -q
Standard 
Variant form 
$

edited Aug 13 '20 at 16:38

answered Aug 10 '20 at 21:29

Culip

559
8
24

score 0 · Answer 4 · answered Aug 10 '20 at 21:56

0

Have awk print all but the last field:

$ awk '/^Standard/ || /^Variant form/ { $(NF)="" }1' emoji.txt
Standard
Variant form

NOTE: This particular solution will leave the field separator (blank) on the end of the output line; if you want to strip the trailing blank you can pipe to sed, tr, etc ... or have awk loop through fields 1 to (NF-1) and output via printf

answered Aug 10 '20 at 21:56

markp-fuso

28,790
4
16
36

Sorry markp-fuso, emoji.txt was just an example. Again my question is how to remove emoji with or without the variant form escape sequence. – Culip Aug 10 '20 at 21:58

Gre-san · Answer 5 · 2020-08-10T23:17:27.870

Use nkf command. nkf -s try to convert character encoding to Shift-jis which does not support emojis. Therefore, emojis and escape sequence will be gone. Finally, revert input to UTF-8 with nkf -w.

$ cat emoji.txt | nkf -s | nkf -w
Standard
Variant form

$ cat emoji.txt | nkf -s | nkf -w | od -tx1c
0000000  53  74  61  6e  64  61  72  64  20  0a  56  61  72  69  61  6e
          S   t   a   n   d   a   r   d      \n   V   a   r   i   a   n
0000020  74  20  66  6f  72  6d  20  0a
          t       f   o   r   m      \n
0000030

I thought ruby may work. Because \p{Emoji} matches emojis. But it remains the escape sequences..

$ ruby -nle 'puts $_.gsub!(/\p{Emoji}/,"")' emoji.txt
Standard
Variant form ️

$ ruby -nle 'puts $_.gsub!(/\p{Emoji}/,"")' emoji.txt | od -tx1c
0000000  53  74  61  6e  64  61  72  64  20  0a  56  61  72  69  61  6e
          S   t   a   n   d   a   r   d      \n   V   a   r   i   a   n
0000020  74  20  66  6f  72  6d  20  ef  b8  8f  0a
          t       f   o   r   m           217  \n
0000033

No, you shouldn't convert Unicode to Shift JIS or any character set for a specific language. The source text can include various characters such as Arabic. Coincidentally, I am a Japanese-speaker too and I would use iconv rather than nkf. — Culip, Aug 11 '20 at 00:32
I noticed that `\p{Emoji_Component}` matches the escape sequence for emoji ([link](http://www.unicode.org/Public/emoji/12.0/emoji-data.txt)). Most of the regex engines do not support it but Rust does. Install [sd](https://github.com/chmln/sd) and `sd '([^#*0-9\P{Emoji_Component}]|[^#*0-9\P{Emoji}])' '' < emoji.txt` may what you need. — Gre-san, Aug 11 '20 at 10:53

UNIX/Linux shell script: Removing variant form emoji from a text

5 Answers5