1

Is there a way to remove anything that's not either a token, punctuation or a special character from text using awk or sed? What I really want to get rid off are the emoticons and the like symbols.

Sample input:

Si tú no estáss yo no voy a lloraar por tiii
Me respondes porfavor?? ❤ piensas venir a Ecuador
cosas veredes!!!! Ay Papá. 
   what y'all know about this?
❤️‼️  ❤️‼️ tag  they make the final decision 
Vähän on twiitattavaa muuta kuin että aijjai ja oijjoi sekä nannaa. 
Binta On est arrivé au chicken elle voulait pleuré carrément tellement elle était heureuse 
ja mir fällt nix mehr ein
Někdo v pátek semnou na flédu na Moju reč??? 

Sample output:

Si tú no estáss yo no voy a lloraar por tiii
Me respondes porfavor?? piensas venir a Ecuador
cosas veredes!!!! Ay Papá.
what y'all know about this?
‼️ ‼️ tag  they make the final decision
Vähän on twiitattavaa muuta kuin että aijjai ja oijjoi sekä nannaa. 
Binta On est arrivé au chicken elle voulait pleuré carrément tellement elle était heureuse
ja mir fällt nix mehr ein
Někdo v pátek semnou na flédu na Moju reč???
totoro
  • 2,469
  • 2
  • 19
  • 23
user3639557
  • 4,791
  • 6
  • 30
  • 55

3 Answers3

1

You might be able to use tr:

% tr -dc '[:print:]' < emoji.txt
Si t no estss yo no voy a lloraar por tiiiMe respondes porfavor??  piensas venir a Ecuadorcosas veredes!!!! Ay Pap.    what y'all know about this?   tag  they make the final decision Vhn on twiitattavaa muuta kuin ett aijjai ja oijjoi sek nannaa. Binta On est arriv au chicken elle voulait pleur carrment tellement elle tait heureuse ja mir fllt nix mehr einNkdo v ptek semnou na fldu na Moju re??? 

As you can see this will also remove newline characters, this can be prevented with:

% tr -dc '[:print:]\n' < emoji.txt
Si t no estss yo no voy a lloraar por tiii
Me respondes porfavor??  piensas venir a Ecuador
cosas veredes!!!! Ay Pap. 
   what y'all know about this?
   tag  they make the final decision 
Vhn on twiitattavaa muuta kuin ett aijjai ja oijjoi sek nannaa. 
Binta On est arriv au chicken elle voulait pleur carrment tellement elle tait heureuse 
ja mir fllt nix mehr ein
Nkdo v ptek semnou na fldu na Moju re???
Andreas Louv
  • 46,145
  • 13
  • 104
  • 123
1

My best solution is using Python, the Python file must be in UTF-8.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re

text = u"""Si tú no estáss yo no voy a lloraar por tiii
Me respondes porfavor?? ❤ piensas venir a Ecuador
cosas veredes!!!! Ay Papá. 
   what y'all know about this?
❤️‼️  ❤️‼️ tag  they make the final decision 
Vähän on twiitattavaa muuta kuin että aijjai ja oijjoi sekä nannaa. 
Binta On est arrivé au chicken elle voulait pleuré carrément tellement elle était heureuse 
ja mir fällt nix mehr ein
Někdo v pátek semnou na flédu na Moju reč???
"""

emoji_pattern = re.compile(
    "["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
    u"\U00002760-\U0000276F"  # emoticons
    "]+", flags=re.UNICODE
)

print(emoji_pattern.sub(r'', text))

Out

Si tú no estáss yo no voy a lloraar por tiii
Me respondes porfavor??  piensas venir a Ecuador
cosas veredes!!!! Ay Papá. 
   what y'all know about this?
‼️  ️‼️ tag  they make the final decision 
Vähän on twiitattavaa muuta kuin että aijjai ja oijjoi sekä nannaa. 
Binta On est arrivé au chicken elle voulait pleuré carrément tellement elle était heureuse 
ja mir fällt nix mehr ein
Někdo v pátek semnou na flédu na Moju reč???
totoro
  • 2,469
  • 2
  • 19
  • 23
0

This command will remove every character that is not alphabetic, numeric, punctuation or white space:

sed 's/[^[:alnum:][:punct:][:space:]]//g' input

Limitation: Note that some of those funny characters that you see might be valid unicode alphabetic characters for which your computer lacks an installed font. This won't remove them.

How it works

[:alnum:], [:punct:], and [:space:] are character classes that match, respectively any alphanumeric, punctuation, or white space character. The regex [^[:alnum:][:punct:][:space:]] matches any character that does not belong to one of those three classes. The sed substitution command s/[^[:alnum:][:punct:][:space:]]//g does global search-and-replace that finds any character not in one of those classes and replaces it with nothing, that is, removes it.

John1024
  • 109,961
  • 14
  • 137
  • 171
  • it doesn't work. Have you tried it on the sample input ? – user3639557 May 21 '16 at 09:35
  • @user3639557 Yes, it worked fine for me. `sed` is a well-developed and reliable tool. When you tried it, what happened? Was there an error message? Did it keep or remove the wrong characters? Or, what? Also, what OS are you using? – John1024 May 21 '16 at 18:06
  • And what operating system? – John1024 May 22 '16 at 02:31
  • Ubuntu - I use sed for replacing other patterns. This one however doesn't work... – user3639557 May 22 '16 at 02:51
  • OK. Try `sed 's/[^[:space:]]//g' input` and you should see no output but whitespace. If that works, then try `sed 's/[^[:punct:][:space:]]//g' input` and you should see just the whitespace and the punctuation. If that works, then try `sed 's/[^[:alnum:][:punct:][:space:]]//g' input`. That should show just whitespace, punctuation and alphanumeric characters. – John1024 May 22 '16 at 07:37