1

I am trying to parse a set of words that contain -- first greek letters, then english letters. This would be easy if there was a delimiter between the sets.That is what I've built so far..

    - (void)loadWordFileToArray:(NSBundle *)bundle  {
        NSLog(@"loadWordFileToArray");

        if (bundle != nil) {
             NSString *path = [bundle pathForResource:@"alfa" ofType:@"txt"];
            //pull the content from the file into memory
            NSData* data = [NSData dataWithContentsOfFile:path];
            //convert the bytes from the file into a string
            NSString* string = [[NSString alloc] initWithBytes:[data bytes]
                                                         length:[data length]
                                                       encoding:NSUTF8StringEncoding];


            //split the string around newline characters to create an array
            NSString* delimiter = @"\n";
            incomingWords = [string componentsSeparatedByString:delimiter];
            NSLog(@"incomingWords count: %lu", (unsigned long)incomingWords.count);
        }
    }

-(void)parseWordArray{
    NSLog(@"parseWordArray");

    NSString *seperator = @" = ";
    int i = 0;
    for (i=0; i < incomingWords.count; i++) {
        NSString *incomingString = [incomingWords objectAtIndex:i];

        NSScanner *scanner = [NSScanner localizedScannerWithString: incomingString];

        NSString *firstString;
        NSString *secondString;
        NSInteger scanPosition;

        [scanner scanUpToString:seperator intoString:&firstString];
        scanPosition = [scanner scanLocation];
        secondString = [[scanner string] substringFromIndex:scanPosition+[seperator length]];

       // NSLog(@"greek: %@", firstString);
       // NSLog(@"english: %@", secondString);

        [outgoingWords insertObject:[NSMutableArray arrayWithObjects:@"greek", firstString, @"english",secondString,@"category", @"", nil] atIndex:0];

        [englishWords insertObject:[NSMutableArray arrayWithObjects:secondString,nil] atIndex:0];
    }
}

But I cannot count on there being delimiters.

I have looked at this question. I want something similar. This would be: grab the characters in the string until an english letter is found. Then take the first group to one new string, and all the characters after to a second new string.

I only have to run this a few times, so optimization is not my highest priority.. Any help would be appreciated..

EDIT:

I've changed my code as shown below to make use of NSLinguisticTagger. This works, but is this the best way? Note that the interpretation for english characters is -- for some reason "und"...

The incoming string is: άγαλμα, το statue, only the last 6 characters are in english.

  int j = 0;
        for (j=0; j<incomingString.length; j++) {
            NSString *language = [tagger tagAtIndex:j scheme:NSLinguisticTagSchemeLanguage tokenRange:NULL sentenceRange:NULL];
            if ([language  isEqual: @"und"]) {
                NSLog(@"j is: %i", j);
                int k = 0;
                for (k=0; k<j; k++) {
                    NSRange range = NSMakeRange (0, k);

                    NSString *tempString = [incomingString substringWithRange:range ];
                     NSLog (@"tempString: %@", tempString);

                }
                return;
            }
            NSLog (@"Language: %@", language);

        }
Community
  • 1
  • 1
ICL1901
  • 7,632
  • 14
  • 90
  • 138

2 Answers2

1

Alright so what you could do is use NSLinguisticTagger to find out the language of the word (or letter) and if the language has changed then you know where to split the string. You can use NSLinguisticTagger like this:

NSArray *tagschemes = @[NSLinguisticTagSchemeLanguage];
NSLinguisticTagger *tagger = [[NSLinguisticTagger alloc] initWithTagSchemes:tagschemes options: NSLinguisticTagPunctuation | NSLinguisticTaggerOmitWhitespace];
[tagger setString:@"This is my string in English."];
NSString *language = [tagger tagAtIndex:0 scheme:NSLinguisticTagSchemeLanguage tokenRange:NULL sentenceRange:NULL];
//Loop through each index of the string's characters and check the language as above.
//If it has changed then you can assume the language has changed.

Alternatively you can use NSSpellChecker's requestCheckingOfString to get teh dominant language in a range of characters:

NSSpellChecker *spellChecker = [NSSpellChecker sharedSpellChecker];
[spellChecker setAutomaticallyIdentifiesLanguages:YES];
NSString *spellCheckText = @"Guten Herr Mustermann. Dies ist ein deutscher Text. Bitte löschen Sie diesen nicht.";

[spellChecker requestCheckingOfString:spellCheckText
  range:(NSRange){0, [spellCheckText length]}
  types:NSTextCheckingTypeOrthography
  options:nil
  inSpellDocumentWithTag:0
  completionHandler:^(NSInteger sequenceNumber, NSArray *results, NSOrthography *orthography, NSInteger wordCount) {
    NSLog(@"dominant language = %@", orthography.dominantLanguage);
}];

This answer has information on how to detect the language of an NSString.

Community
  • 1
  • 1
KerrM
  • 5,139
  • 3
  • 35
  • 60
  • Thank you very much!. I will play this afternoon to try these techniques. I'm much obliged.. +1 – ICL1901 Aug 13 '14 at 10:23
  • OK I think I've got it. I've edited my code. There is still one question -- why english is tagged as "und". I will look at the link you gave me to see if I can resolve that – ICL1901 Aug 13 '14 at 11:16
  • Perhaps it is because it doesn't have enough information to determine whether a character is English or any other latin alphabet language. I would suggest determining the language of a word instead of a single character. – KerrM Aug 13 '14 at 11:19
  • From the docs for `NSOrtography` "the tag und is used if a specific language cannot be determined." https://developer.apple.com/library/mac/documentation/cocoa/Reference/NSOrthography_Class/Reference/Reference.html#//apple_ref/occ/cl/NSOrthography – KerrM Aug 13 '14 at 11:28
  • Yes. I got that.. I just don't know why the english word could't be identified.. bug? – ICL1901 Aug 13 '14 at 13:07
  • My first thought would be that one word does not provide enough context for the process to determine a language. – KerrM Aug 13 '14 at 13:14
  • Thanks for the help. I've the file I need ! – ICL1901 Aug 13 '14 at 14:50
1

Allow me to introduce two good friends of mine. NSCharacterSet and NSRegularExpression. Along with them, normalization. (In Unicode terms)

First, you should normalize strings before analyzing them against a character set. You will need to look at the choices, but normalizing to all composed forms is the way I would go. This means an accented character is one instead of two or more. It simplifies the number of things to compare.

Next, you can easily build your own NSCharacterSet objects from strings (loaded from files even) to use to test set membership.

Lastly, regular expressions can achieve the same thing with Unicode Property Names as classes or categories of characters. Regular expressions could be more terse but more expressive.

uchuugaka
  • 12,679
  • 6
  • 37
  • 55
  • Thank you uchuugaka! Regular Expressions are something I've wanted to learn. Could you recommend any good reading material? I know where to study NSCharacterSet.. Thanks again. +1 – ICL1901 Aug 13 '14 at 10:26
  • 1
    There is no better book than Mastering Regular Expressions. The first 3 chapters will get you a lot. It's a lot easier to learn from Ruby or Python or Perl, then transfer to Objective-C. – uchuugaka Aug 13 '14 at 10:29
  • If you learn one thing this year, make it Regular Expressions. It is a skill that transfers to any programming environment and really helps forever. – uchuugaka Aug 13 '14 at 10:31