2

Consider you are using a Linux/UNIX shell whose default character set is UTF-8:

$ echo $LANG
en_US.UTF-8

You have a text file, emoji.txt, which is coded in UTF-8:

$ file -i ./emoji.txt
./emoji.txt: text/plain; charset=utf-8

This text file contains some emoji and a variant form escape sequence:

$ cat     ./emoji.txt
Standard ☁
Variant form ☁️
$ uni2ascii -a B -q ./emoji.txt
Standard \x2601
Variant form \x2601\xFE0F

You want to remove both emoji, including that variant form character (\xFE0F), and so the output should be

Standard 
Variant form 

How would you do this?

Update. This question is not about how to remove the last word in every line. Imagine emoji2.txt that includes a large text with many emoji characters; and some of these are followed by the variant form sequence.

Culip
  • 559
  • 8
  • 24

5 Answers5

1

With GNU sed and bash:

  sed -E s/$'\u2601\uFE0F?'//g emoji.txt
M. Nejat Aydin
  • 9,597
  • 1
  • 7
  • 17
  • In Z shell (zsh), replace `?` with `\?`. – Culip Aug 13 '20 at 16:41
  • 1
    @Culip You're right. It is better to put the `?` inside the `$' '` in `bash` as well. Otherwise it will be interpreted as a filename matching pattern. I've corrected it. – M. Nejat Aydin Aug 13 '20 at 16:55
1

You can use awk, like this:

$ cat emo.ascii 
Standard \x2601
Variant form \x2601\xFE0F
$ ascii2uni -a B emo.ascii                                  
Standard ☁
Variant form ☁️
3 tokens converted # note: this is stderr
$ ascii2uni -a B emo.ascii | awk -F' ' '{NF--}1' | cat -A 
3 tokens converted # note: this is stderr
Standard$
Variant form$

NF-- will decrease the field count in awk, which effectively removes the last field. 1 evaluates to true, which makes awk print the modified line.

(Used cat -A here only to show that there aren't any invisible characters left)

hek2mgl
  • 152,036
  • 28
  • 249
  • 266
0

Convert the Unicode text file to ASCII and remove those Unicode characters that are represented by ASCII characters, and convert it to UTF-8 again:

$ uni2ascii -q ./emoji.txt | sed "s/ 0x2601\(0xFE0F\)\?//g" | ascii2uni -q
Standard 
Variant form 
$
Culip
  • 559
  • 8
  • 24
0

Have awk print all but the last field:

$ awk '/^Standard/ || /^Variant form/ { $(NF)="" }1' emoji.txt
Standard
Variant form

NOTE: This particular solution will leave the field separator (blank) on the end of the output line; if you want to strip the trailing blank you can pipe to sed, tr, etc ... or have awk loop through fields 1 to (NF-1) and output via printf

markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • Sorry markp-fuso, emoji.txt was just an example. Again my question is how to remove emoji with or without the variant form escape sequence. – Culip Aug 10 '20 at 21:58
0

Use nkf command. nkf -s try to convert character encoding to Shift-jis which does not support emojis. Therefore, emojis and escape sequence will be gone. Finally, revert input to UTF-8 with nkf -w.

$ cat emoji.txt | nkf -s | nkf -w
Standard
Variant form

$ cat emoji.txt | nkf -s | nkf -w | od -tx1c
0000000  53  74  61  6e  64  61  72  64  20  0a  56  61  72  69  61  6e
          S   t   a   n   d   a   r   d      \n   V   a   r   i   a   n
0000020  74  20  66  6f  72  6d  20  0a
          t       f   o   r   m      \n
0000030

I thought ruby may work. Because \p{Emoji} matches emojis. But it remains the escape sequences..

$ ruby -nle 'puts $_.gsub!(/\p{Emoji}/,"")' emoji.txt
Standard
Variant form ️

$ ruby -nle 'puts $_.gsub!(/\p{Emoji}/,"")' emoji.txt | od -tx1c
0000000  53  74  61  6e  64  61  72  64  20  0a  56  61  72  69  61  6e
          S   t   a   n   d   a   r   d      \n   V   a   r   i   a   n
0000020  74  20  66  6f  72  6d  20  ef  b8  8f  0a
          t       f   o   r   m           217  \n
0000033

Gre-san
  • 351
  • 1
  • 5
  • 2
    No, you shouldn't convert Unicode to Shift JIS or any character set for a specific language. The source text can include various characters such as Arabic. Coincidentally, I am a Japanese-speaker too and I would use iconv rather than nkf. – Culip Aug 11 '20 at 00:32
  • I noticed that `\p{Emoji_Component}` matches the escape sequence for emoji ([link](http://www.unicode.org/Public/emoji/12.0/emoji-data.txt)). Most of the regex engines do not support it but Rust does. Install [sd](https://github.com/chmln/sd) and `sd '([^#*0-9\P{Emoji_Component}]|[^#*0-9\P{Emoji}])' '' < emoji.txt` may what you need. – Gre-san Aug 11 '20 at 10:53