NSCharacter Set uses int's but i need unassigned short?

Question

I am using MWFeedParser to add a feed into my app. Now the framework passes date's and I it has a few warnings mainly due to older type of code.

Now there are 4 warnings left which are all the same and technically I can fix them and remove them so that the warnings are gone, but then I get left with the app not working properly.

The code concerning is:

    // Character sets
NSCharacterSet *stopCharacters = [NSCharacterSet characterSetWithCharactersInString:[NSString stringWithFormat:@"< \t\n\r%C%C%C%C", 0x0085, 0x000C, 0x2028, 0x2029]];

Now the bit that is the warning is:

\t\n\r%C%C%C%C", 0x0085, 0x000C, 0x2028, 0x2029]];

The warning is:

Format specifies type 'unsigned short' but the argument has type 'int'

So I changed into:

\t\n\r%i%i%i%i", 0x0085, 0x000C, 0x2028, 0x2029]];

which indeed removed the warnings and gave me perfect code:-) (no warnings or errors)

When I then ran the app it did not parse the date and it was not able to open the link. I am not sure if this a is C thing, but right now it is definitely outside of my knowledge field. Is there anyone who can help me that can fix this problem, and still have it working in the app??

Thank you in advance:-)

EDIT

     - (NSString *)stringByConvertingHTMLToPlainText {

// Pool
NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

// Character sets
NSCharacterSet *stopCharacters = [NSCharacterSet characterSetWithCharactersInString:@"< \t\n\r\x0085\x000C\u2028\u2029"];    
NSCharacterSet *newLineAndWhitespaceCharacters = [NSCharacterSet characterSetWithCharactersInString:@"< \t\n\r\205\014\u2028\u2029"];


NSCharacterSet *tagNameCharacters = [NSCharacterSet characterSetWithCharactersInString:@"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"];

// Scan and find all tags
NSMutableString *result = [[NSMutableString alloc] initWithCapacity:self.length];
NSScanner *scanner = [[NSScanner alloc] initWithString:self];
[scanner setCharactersToBeSkipped:nil];
[scanner setCaseSensitive:YES];
NSString *str = nil, *tagName = nil;
BOOL dontReplaceTagWithSpace = NO;
do {

    // Scan up to the start of a tag or whitespace
    if ([scanner scanUpToCharactersFromSet:stopCharacters intoString:&str]) {
        [result appendString:str];
        str = nil; // reset
    }

    // Check if we've stopped at a tag/comment or whitespace
    if ([scanner scanString:@"<" intoString:NULL]) {

        // Stopped at a comment or tag
        if ([scanner scanString:@"!--" intoString:NULL]) {

            // Comment
            [scanner scanUpToString:@"-->" intoString:NULL]; 
            [scanner scanString:@"-->" intoString:NULL];

        } else {

            // Tag - remove and replace with space unless it's
            // a closing inline tag then dont replace with a space
            if ([scanner scanString:@"/" intoString:NULL]) {

                // Closing tag - replace with space unless it's inline
                tagName = nil; dontReplaceTagWithSpace = NO;
                if ([scanner scanCharactersFromSet:tagNameCharacters intoString:&tagName]) {
                    tagName = [tagName lowercaseString];
                    dontReplaceTagWithSpace = ([tagName isEqualToString:@"a"] ||
                                               [tagName isEqualToString:@"b"] ||
                                               [tagName isEqualToString:@"i"] ||
                                               [tagName isEqualToString:@"q"] ||
                                               [tagName isEqualToString:@"span"] ||
                                               [tagName isEqualToString:@"em"] ||
                                               [tagName isEqualToString:@"strong"] ||
                                               [tagName isEqualToString:@"cite"] ||
                                               [tagName isEqualToString:@"abbr"] ||
                                               [tagName isEqualToString:@"acronym"] ||
                                               [tagName isEqualToString:@"label"]);
                }

                // Replace tag with string unless it was an inline
                if (!dontReplaceTagWithSpace && result.length > 0 && ![scanner isAtEnd]) [result appendString:@" "];

            }

            // Scan past tag
            [scanner scanUpToString:@">" intoString:NULL];
            [scanner scanString:@">" intoString:NULL];

        }

    } else {

        // Stopped at whitespace - replace all whitespace and newlines with a space
        if ([scanner scanCharactersFromSet:newLineAndWhitespaceCharacters intoString:NULL]) {
            if (result.length > 0 && ![scanner isAtEnd]) [result appendString:@" "]; // Dont append space to beginning or end of result
        }

    }

} while (![scanner isAtEnd]);

// Cleanup
[scanner release];

// Decode HTML entities and return
NSString *retString = [[result stringByDecodingHTMLEntities] retain];
[result release];

// Drain
[pool drain];

// Return
return [retString autorelease];

}

First, thanks for providing the context in which you're attempting to use your character set. I see that your 0x0085 is called "Next Line (NEL)" in the Unicode spec. But I'm at a loss as to why the compiler complains when I attempt to specify that code as a unichar. I've withdrawn my answer since I couldn't be more complete. But, best wishes to you. — Extra Savoir-Faire, Nov 25 '12 at 03:59
Thanks for your comment - I should have copy and pasted your answer to have a better play around with it:-) I hope to get there:-) — jwknz, Nov 25 '12 at 04:09
Form feed (0x0C) can be \f. The 0x85 can be done with %c. `NSCharacterSet *stopCharacters = [NSCharacterSet characterSetWithCharactersInString:[NSString stringWithFormat:@"< \t\n\r%c\f\u2028\u2029", 0x85]];` — rmaddy, Nov 25 '12 at 04:30
My conclusion is that the warning is actually a compiler bug. The compiler should not generate that warning. — Dietrich Epp, Nov 25 '12 at 06:19

Dietrich Epp · Accepted Answer · 2012-11-25T06:52:30.013

This is a total mess

The reason this is a total mess is because you are running into a compiler bug and an arbitrary limitation in the C spec.

Scroll to the bottom for the fix.

Compiler warning

Format specifies type 'unsigned short' but the argument has type 'int'

My conclusion is that this is a compiler bug in Clang. It is definitely safe to ignore this warning, because (unsigned short) arguments are always promoted to (int) before they are passed to vararg functions anyway. This is all stuff that is in the C standard (and it applies to Objective C, too).

printf("%hd", 1); // Clang generates warning. GCC does not.
                  // Clang is wrong, GCC is right.

printf("%hd", 1 << 16); // Clang generates warning.  GCC does not.
                        // Clang is right, GCC is wrong.

The problem here is that neither compiler looks deep enough.

Remember, it is actually impossible to pass a short to printf(), because it must get promoted to int. GCC never gives a warning for constants, Clang ignores the fact that you are passing a constant and always gives a warning because the type is wrong. Both options are wrong.

I suspect nobody has noticed because -- why would you be passing a constant expression to printf() anyway?

In the short term, you can use the following hack:

#pragma GCC diagnostic ignored "-Wformat"

Universal character names

You can use \uXXXX notation. Except you can't, because the compiler won't let you use U+0085 this way. Why? See § 6.4.3 of C99:

A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (@), or 0060 (‘), nor one in the range D800 through DFFF inclusive.

This rules out \u0085.

There is a proposal to fix this part of the spec.

The fix

You really want a constant string, don't you? Use this:

[NSCharacterSet characterSetWithCharactersInString:
  @"\t\n\r\xc2\x85\x0c\u2028\u2029"]

This relies on the fact that the source encoding is UTF-8. Don't worry, that's not going to change any time soon.

The \xc2\x85 in the string is the UTF-8 encoding of U+0085. The appearance of 85 in both is a coincidence.

This answer definitely worked and so it has been accepted. I will do more research into this. but thanks for the in depth explanation. — jwknz, Nov 25 '12 at 08:52

score 2 · Answer 2 · answered Nov 25 '12 at 04:38

2

The problem is that 0x0085, etc are literal ints. So they don't match the %C format specifier, which expects a unichar, which is an unsigned short.

There's no direct way to specify a literal short in C and I'm not aware of any Objective-C extension. But you can use a brute-force approach:

NSCharacterSet *stopCharacters =
         [NSCharacterSet characterSetWithCharactersInString:
                  [NSString stringWithFormat:@"< \t\n\r%C%C%C%C", 
                               (unichar)0x0085, (unichar)0x000C,
                               (unichar)0x2028, (unichar)0x2029]];

answered Nov 25 '12 at 04:38

Tommy

99,986
12
185
204

`0x0C` is form feed which can be represented by `\f`. – rmaddy Nov 25 '12 at 04:43
Interesting. In spite of the cast to `(unichar)`, the constants will be promoted *back* to `int` when they are passed to `stringWithFormat:`. – Dietrich Epp Nov 25 '12 at 04:44
@DietrichEpp yep, they'll have to go through the normal C promotion rules as part of the vararg side of things but the compiler warning is sort of looking around that. Also weirdly the compiler will accept `\u2028`, `\u2029` but not anything in the ASCII range; `\x0c` (or `\f`) is acceptable as ASCII but `\x85` is then flagged up as invalid UTF-8. So I'm immediately stuck on an in-line `printf` solution for that other than breaking 0x0085 into UTF-8 and providing as `\x`s. – Tommy Nov 25 '12 at 04:49
I'm a complete newbie when it comes to encodings–and so grateful to have found these answers! Any potential pitfalls with the cast-to-unichar approach? – Mark Feb 05 '13 at 21:20

score 0 · Answer 3 · answered Nov 25 '12 at 04:30

0

You don't need stringWithFormat, you can embed unicode chars directly into a string using the \u escape. For example \u0085.

answered Nov 25 '12 at 04:30

borrrden

33,256
8
74
109

1

But the \u0085 won't compile. – rmaddy Nov 25 '12 at 04:31