3

I have a text mining project and the key text has many of them with non-english (rows), e.g. follows. Can anyone offer a solution on how to automate the process of identifying these texts so I can delete them?

These data are store in MySQL db and csv, so any suggestion is welcome.

<†Û†ÛÛ†Û”†Û_†ä¢†ÛÀ†Û_ë_©”冾†£†™†Â†_†—ë__†Ü† †é†å´_•´_•_Èä†äé†Û_†Ûã†Û_†äê†_ë_ã†Âë_†_ë_Ć£†_†Ü†_†ã†™†—ä´_•´_ê_“_´_ê´_ê_ãdž¾†£†™† †ä_† ëÄå†_àà†ä–†_†Ü†_†_†ä–†_†ã†™†—ä†é†Â†_†Û†ÛÛ†Û”†Û_†ä¢†ÛÀ†Û_†_†ã†™†‘†—_ÈÂ_—_†â†Ûæ†Û_†Û_†Ûâ†Â†Ü†ÜëÙ_†â__ëæÄ†ã†_†ää†ä_†_†_†ã†™†—ä_“Œë_–†À룆ä_†Û™†äê_ †_ëèÛ_Ćã†_†™†Â†_†È†‘†—ä†Ûæ†ä_†ä_†Û_†——†äé†Û_†——NY†äê_Àë_†å†äà†ä_†ä¢†È†ä ë___ç•_܆é†äé†Û_†Û—†Û_†_†ã†™†‘†—†ä_†äå†ä_†äè†Û_†Â___†ää__Ć—Ć_†__çã†äê__膣†_††ä‘†_†ã†™†ã†—†é†ä–†å†Ü†£†_†ää†_ä_†Àë__†è†‘†ä†ä_†äÙ†ÛdžÛ_†äÙ†Û_†__™”†ä_† ††—†Û_†ÛÛ†äà†ÛÇ†Û†Ûæ†Û_†ää†äÀ†Ûé†Û_†Û™†Û_†_†—†Û†Û_†ä_†ä±†Û_†ä_†À_ÈÂ_—_†ä_†ä_†Û™†Ûâ†ä_†Û_†éë•_†___—Ä_†åå—†£†_†Ü†™_Àë_†‘†äŒ†Û܆䌆ÛÜ_£™†_†_†Ü†_†ã†™†—äëã_Ž…†Â†_ë__†«†ä–†ää____ã†_ëÜé†ã† †£†™†_†È†—ä
<El lugar est’ bueno  la comida tambi’©n  los precios demasiado caros para este tipo de resto. Quiero rescatar la atenci’_n que fue muy buena.
kevin
  • 1,107
  • 1
  • 13
  • 17
  • When getting such data you should read them as HEX or BINARY. Messing with these problems in character-encoding will break your brain. –  Mar 16 '14 at 00:05
  • I am not sure but it can be chinese characters in UTF-16 – Casimir et Hippolyte Mar 16 '14 at 00:36
  • Some of the answers on this page have some suggestions you can try. Although the general consensus seems to be that it's a bad idea. The one that seemed best to me was to dump the table, do a string find/replace and then import it back in. [Stack Overflow Article](http://stackoverflow.com/questions/986826/how-to-do-a-regular-expression-replace-in-mysql) – Quixrick Mar 16 '14 at 00:40
  • I don't know who downvotes this question (and probably votes to close it, OMG), but it's stupid. Is there a badge for offensive speech? – Casimir et Hippolyte Mar 16 '14 at 00:49
  • @Quixrick thank you, but looks like it does not solve it. I saw this might be helpful: `ORDER BY text COLLATE utf8_bin in MySQL`, but still I see a lot of spanish words. – kevin Mar 16 '14 at 02:28
  • @CasimiretHippolyte thank you. I am not sure, I am trying to identify those text, so I can remove them. – kevin Mar 16 '14 at 02:29
  • @Allendar, thank you. I think so. Looking for some remedies now. I am sure some of you have encountered this before? – kevin Mar 16 '14 at 03:34
  • It will be best for us to get a HEX dump of your CSV. There are tons of HEX-editors to find online for free. If you open the CSV that way and can place a screenshot and/or copy-paste the value in the question it will greatly help. The text you pasted here is hard to distill as it already has gone through encoding changes. This could even be UTF-32. –  Mar 16 '14 at 10:15
  • Distilling your posted UTF-8 formatted characters to UTF-16 I get some untranslatable (at least for me) Hangul (Korean): `U+3CE2 U+80A0 U+C39B` results in `㳢肠쎛` –  Mar 16 '14 at 10:31
  • The second character in Chinese means `intestinal`. Any way it might be medical data? The readable part of your data does seem to talk about food too :P –  Mar 16 '14 at 10:37
  • @Allendar, not only these unreadable words, but also these: I soliti 8 sabato 26 ottobre abbiamo provato da Gaetano e devo dire che subito siamo stati piacevolmente colpiti dalla gentilezza e cortesia di Gaetano ma ad onor del vero tutto il personale è stato gentilissimo; ottima pizza cotta sapientemente e dall'impasto delicato e digeribilisimo si sentivano i – kevin Mar 17 '14 at 04:26
  • @Allendar, haha, not medical data, reviews about restaurant. Intestinal can be cooked to be some kind of food too.... – kevin Mar 17 '14 at 04:27

1 Answers1

1

Unicode Character Class

[\u007f-\uffff]

That will remove pretty much every non-english character...


Result

This is the result I get from your text:

<______ ___________________ _ _________________________ ________NY___ _________________________________ _____________________________________ _
<El lugar est bueno  la comida tambin  los precios demasiado caros para este tipo de resto. Quiero rescatar la atenci_n que fue muy buena.
Vasili Syrakis
  • 9,321
  • 1
  • 39
  • 56