6

I have an app that syncs data from a remote DB that users populate. Seems people copy and paste crap from a ton of different OS's and programs which can cause different hidden non ASCII values to be imported into the system.

For example I end up with this:

Artist:â â Ioco

This ends up getting sent back into system during sync and my JSON conversion furthers the problem and invalid characters in various places cause my app to crash.

How do I search for and clean out any of these invalid characters?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Slee
  • 27,498
  • 52
  • 145
  • 243
  • In a nutshell: create a new mutable string, iterate over all characters, check if it's an ASCII character and if so, append it to a the string. –  Jun 15 '11 at 17:15
  • 17
    In 2011 there's really no excuse not to handle unicode properly (http://www.joelonsoftware.com/articles/Unicode.html). Remember that real people can and do have names like José or Müller or Jönsson or even Məmmədov or ბერიძე or 陈. – damian Jun 15 '11 at 17:27
  • 4
    This "crap" letters from other languages than English. You should try to figure out the right encoding to preserve the letters. – vikingosegundo Jun 15 '11 at 17:33
  • turns out it wasn't those characters but some hexadecimal values I could not see – Slee Jun 17 '11 at 13:23
  • 2
    Those 'hexadecimal values' that you can't see will be components of multi-byte (Unicode) characters that your software isn't handling properly. – damian Jun 18 '11 at 14:08

2 Answers2

23

While I strongly believe that supporting unicode is the right way to go, here's an example of how you can limit a string to only contain certain characters (in this case ASCII):

NSString *test = @"Olé, señor!";

NSMutableString *asciiCharacters = [NSMutableString string];
for (NSInteger i = 32; i < 127; i++)  {
    [asciiCharacters appendFormat:@"%c", i];
}

NSCharacterSet *nonAsciiCharacterSet = [[NSCharacterSet characterSetWithCharactersInString:asciiCharacters] invertedSet];

test = [[test componentsSeparatedByCharactersInSet:nonAsciiCharacterSet] componentsJoinedByString:@""];

NSLog(@"%@", test); // Prints @"Ol, seor!"
Cristik
  • 30,989
  • 25
  • 91
  • 127
Morten Fast
  • 6,322
  • 27
  • 36
  • 2
    No, because `stringByTrimmingCharactersInSet` only trims the ends of the string, and therefore won't remove all the characters. – Morten Fast Aug 15 '12 at 09:04
  • 1
    I agree that Unicode is the way to go. However in some cases this might still be valid. I have to generate QR Codes and I think that umlauts and the like are not ideal characters there. – Besi Apr 16 '14 at 13:30
  • Thanks, mate! This was brilliant. – Felipe Feb 16 '17 at 21:32
  • 1
    @Cristik: sorry for making the changes, you were right, they're best added as a separate answer. – NSGod Sep 04 '22 at 14:40
0

A simpler version of Morten Fast's answer:

NSString *test = @"Olé, señor!";

NSCharacterSet *nonAsciiCharacterSet = [[NSCharacterSet 
           characterSetWithRange:NSMakeRange(32, 127 - 32)] invertedSet];

test = [[test componentsSeparatedByCharactersInSet:nonAsciiCharacterSet] 
                          componentsJoinedByString:@""];

NSLog(@"%@", test); // Prints @"Ol, seor!"

Notably, this uses NSCharacterSet's +characterSetWithRange: method to simply specify the desired ASCII range rather than having to create a string, etc.

The results are identical, as comparing one to the other with isEqual: returns YES.

NSGod
  • 22,699
  • 3
  • 58
  • 66