3

I have a field with encoding utf8-general-ci in which many values contain non-ascii characters. I want to

  1. Search for all fields with any non-ascii characters

  2. Replace all non-ascii characters with their corresponding ascii version.

For example: côte-d'ivoire should be replaced with cote-d-i'voire, são-tomé should be replaced with sao-tome, etc.

How do I achieve this? If I just change the field type to ascii, non-ascii characters get replaced by '?'. I am not even able to search for all such fields using

RLIKE '%[^a-z]%'

For example

SELECT columname 
FROM tablename 
WHERE NOT columname REGEXP '[a-z]';

returns an empty set.

Thanks

workwise
  • 1,003
  • 16
  • 33
  • What do you mean by _non-ascii characters'? Are the values properly coded in utf8 or do you encounter encoding issues as it is? – Bjoern Aug 22 '13 at 04:25
  • Keep in mind that "ô" is not "ASCII". US-ASCII is the 7-bit range encompassing only the English alphabet. Your example of "côte-d'ivoire" => "côte-d-ivoire", because 1. "ô" is not ASCII, and 2. there is nothing wrong with the single-quote character. – SineSwiper Dec 05 '14 at 22:26

3 Answers3

3

An sql fiddle example is at

http://www.sqlfiddle.com/#!2/c1d90/1/0

the query to select is

select * from test where maintext rlike  '[^\x00-\x7F]'

Hope this helps

skv
  • 1,793
  • 3
  • 19
  • 27
  • 1
    Thanks, this helps a lot and though automatic conversion is bit more involved I guess (custom function?), this will help me for the time being. – workwise Aug 22 '13 at 04:46
  • http://stackoverflow.com/questions/2743070/removing-non-ascii-characters-from-a-string-using-python-django may help in actual removal – skv Aug 22 '13 at 04:48
  • 1
    I don't believe that the OP wishes to remove the actual non_ascii characters but replace them with the ascii equiv. Thus http://stackoverflow.com/questions/3833791/python-regex-to-convert-non-ascii-characters-in-a-string-to-closest-ascii-equiva would probably be more relative. – Llanilek Nov 04 '13 at 13:48
  • 1
    That being said, we cannot just presume that the OP is using python or django. If they are using python then https://pypi.python.org/pypi/Unidecode/0.04.1 would work perfectly. Seeing as most of their questions are PHP based I doubt they would be using python. – Llanilek Nov 04 '13 at 14:21
1

I'm presuming from your previous questions that you're using PHP.

https://github.com/silverstripe-labs/silverstripe-unidecode

You could then use skv's answer to return the object's you wish to use and then use unidecode to attempt to convert the object to it's ascii equivalents.

Community
  • 1
  • 1
Llanilek
  • 3,386
  • 5
  • 39
  • 65
1

In Perl, you can use Text::Unidecode.

In MySQL, there isn't an easy function to convert from utf8 (or utf8mb4) into ascii without it spitting out some ugly '?' characters as replacements. It's best to replace these prior to inserting them in the DB, or run something in Perl (or whatever) to extract the data and re-update it one row at a time.

There are many different ports of Text::Unidecode in different languages: Python, PHP, Java, Ruby, JavaScript, Haskell, C#, Clojure, Go.

SineSwiper
  • 2,046
  • 2
  • 18
  • 22