1

I have an issue where I want to parse some Markdown, and when I try to parse text with emphasis, where the text wrapped in underscores is to be emphasized (such as this is some _emphasized_ text).

However links also have underscores in them, such as http://example.com/text_with_underscores/, and currently my regular expression would pick up _with_ as an attempt at emphasized text.

Obviously I don't want it to, and as text with emphasis in the middle of it is valid (such as longword*with*emphasis being valid), my go to solution is to parse links first, and almost "mark" those replacements to not be touched again. Is this possible?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Doug Smith
  • 29,668
  • 57
  • 204
  • 388

3 Answers3

0

One solution you can implement like this:-

NSString *yourStr=@"this is some _emphasized_ text";
NSMutableString *mutStr=[NSMutableString string];
NSUInteger count=0;
for (NSUInteger i=0; i<yourStr.length; i++)
{
    unichar c =[yourStr characterAtIndex:i];
    if ((c=='_') && (count==0))
    {
    [mutStr appendString:[NSString stringWithFormat:@"%@",@"<em>"]];
        count++;
    }
    else if ((c=='_') && (count>0))
    {
        [mutStr appendString:[NSString stringWithFormat:@"%@",@"</em>"]];
        count=0;
    }
    else
    {
        [mutStr appendString:[NSString stringWithFormat:@"%C",c]];
    }

}
NSLog(@"%@",mutStr);

Output:-

this is some <em>emphasized</em> text
Hussain Shabbir
  • 14,801
  • 5
  • 40
  • 56
  • Wait, that output doesn't make sense. As I said in the original post, it should not do it for URLs, but it should for non-URLs. I would want a URL to be parsed as one without adding random emphasis, and emphasized text to be emphasized it. – Doug Smith Dec 15 '14 at 07:29
0
__block NSString *yourString = @"media_w940996738_ _help_  476.mp3";
NSError *error = NULL;
__block NSString *yourNewString;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"([_])\\w+([_])" options:NSRegularExpressionCaseInsensitive error:&error];

  yourNewString=[NSString stringWithString:yourString];
[regex enumerateMatchesInString:yourString options:0 range:NSMakeRange(0, [yourString length]) usingBlock:^(NSTextCheckingResult *match, NSMatchingFlags flags, BOOL *stop){

    // detect

    NSString *subString = [yourString substringWithRange:[match rangeAtIndex:0]];
    NSRange range=[match rangeAtIndex:0];
    range.location+=1;
    range.length-=2;
    //print
    NSString *string=[NSString stringWithFormat:@"<em>%@</em>",[yourString substringWithRange:range] ];
    yourNewString = [yourNewString stringByReplacingOccurrencesOfString:subString withString:string];


}];
johny kumar
  • 1,270
  • 2
  • 14
  • 24
0

First a more usual way to do processing like this would be to tokenise the input; this both makes handling each kind of token easier and is probably more efficient for large inputs. That said, here is how to solve your problem using regular expressions.

Consider:

  1. matchesInString:options:range returns all the non-overlapping matches for a regular expression.

  2. Regular expressions are built from smaller regular expressions and can contain alternatives. So if you have REemphasis which matches strings to emphasise and REurl which matches URLs, then (REemphasis)|(REurl) matches both.

  3. NSTextCheckingResult, instances of which are returned by matchesInString:options:range, reports the range of each group in the match, and if a group does not occur in the result due to alternatives in the pattern then the group's NSRange.location is set to NSNotFound. So for the above pattern, (REemphasis)|(REurl), if group 1 is NSNotFound the match is for the REurl alternative otherwise it is for REemphasis alternative.

  4. The method replacementStringForResult:inString:offset:template will return the replacement string for a match based on the template (aka the replacement pattern).

The above is enough to write an algorithm to do what you want. Here is some sample code:

- (NSString *) convert:(NSString *)input
{
   NSString *emphPat = @"(_([^_]+)_)"; // note this pattern does NOT allow for markdown's \_ escapes - that needs to be addressed
   NSString *emphRepl = @"<em>$2</em>";

   // a pattern for urls - use whatever suits
   // this one is taken from http://stackoverflow.com/questions/6137865/iphone-reg-exp-for-url-validity
   NSString *urlPat = @"([hH][tT][tT][pP][sS]?:\\/\\/[^ ,'\">\\]\\)]*[^\\. ,'\">\\]\\)])";

   // construct a pattern which matches emphPat OR urlPat
   // emphPat is first so its two groups are numbered 1 & 2 in the resulting match
   NSString *comboPat = [NSString stringWithFormat:@"%@|%@", emphPat, urlPat];

   // build the re
   NSError *error = nil;
   NSRegularExpression *re = [NSRegularExpression regularExpressionWithPattern:comboPat options:0 error:&error];
   // check for error - omitted

   // get all the matches - includes both urls and text to be emphasised
   NSArray *matches = [re matchesInString:input options:0 range:NSMakeRange(0, input.length)];

   NSInteger offset = 0;                        // will track the change in size
   NSMutableString *output = input.mutableCopy; // mutuable copy of input to modify to produce output

   for (NSTextCheckingResult *aMatch in matches)
   {
      NSRange first = [aMatch rangeAtIndex:1];

      if (first.location != NSNotFound)
      {
         // the first group has been matched => that is the emphPat (which contains the first two groups)

         // determine the replacement string
         NSString *replacement = [re replacementStringForResult:aMatch inString:output offset:offset template:emphRepl];

         NSRange whole = aMatch.range;                // original range of the match
         whole.location += offset;                    // add in the offset to allow for previous replacements
         offset += replacement.length - whole.length; // modify the offset to allow for the length change caused by this replacement

         // perform the replacement
         [output replaceCharactersInRange:whole withString:replacement];
      }
   }

   return output;
}

Note the above does not allow for Markdown's \_ escape sequence and you need to address that. You probably also need to consider the RE used for URLs - one was just plucked from SO and hasn't been tested properly.

The above will convert

http://example.com/text_with_underscores _emph_

to

http://example.com/text_with_underscores <em>emph</em>

HTH

CRD
  • 52,522
  • 5
  • 70
  • 86