0

I have the following input:

<table class="fiche_table_caracter"><tbody>
<tr>
    <td class="caracteristique"><strong>Design</strong></td>
    <td>Classique (full tactile)</td>
</tr>

<tr>
    <td class="caracteristique"><strong>Système d'exploitation (OS)</strong></td>
    <td>iOS</td>
</tr>
<tr>
    <td class="caracteristique"><strong>Ecran</strong></td>
    <td>4,7'' (1334 x 750 pixels)<br />16 millions de couleurs</td>
</tr>
<tr>
    <td class="caracteristique"><strong>Mémoire interne</strong></td>
    <td>128 Go, 1 Go RAM</td>
</tr>
<tr>
    <td class="caracteristique"><strong>Appareil photo</strong></td>
    <td>8 mégapixels</td>
</tr>
</tbody>
</table>

I need to extract only the content of the <td> tags. This is what I did:

NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"<tr*>(.*?)</tr>" options:NSRegularExpressionCaseInsensitive error:NULL];

            NSArray *myArray = [regex matchesInString:str options:0 range:NSMakeRange(0, [str length])] ;
            UA_log(@"counttt: %d", [myArray count]);
            NSMutableArray *matches = [NSMutableArray arrayWithCapacity:[myArray count]];

            for (NSTextCheckingResult *match in myArray) {
                NSRange matchRange = [match rangeAtIndex:1];
                [matches addObject:[str substringWithRange:matchRange]];
                NSLog(@"Regex output:%@", [matches lastObject]);
                NSString * str2 = [matches lastObject];
                NSRegularExpression *regex2 = [NSRegularExpression regularExpressionWithPattern:@"<td*>(<strong>)?(.*?)(</strong>)?</td>" options:NSRegularExpressionCaseInsensitive error:NULL];

                NSArray *myArray2 = [regex2 matchesInString:str2 options:0 range:NSMakeRange(0, [str2 length])] ;
                UA_log(@"counttt: %d", [myArray2 count]);
                NSMutableArray *matches2 = [NSMutableArray arrayWithCapacity:[myArray2 count]];

                for (NSTextCheckingResult *match2 in myArray2) {
                    NSRange matchRange2 = [match2 rangeAtIndex:1];
                    [matches2 addObject:[str2 substringWithRange:matchRange2]];
                    NSLog(@"Regex2 output:%@", [matches2 lastObject]);
                    NSString * lastObject2 = [matches2 lastObject];

                }

            }

The issue I get is that I would like to set the tag <Strong> as optional but it doesn't work. With this code, I could extract the "tr" but not the content of the "td".

Please help!

I would like to extract:

1-

Design

Classique (full tactile)

2-

Système d'exploitation (OS)

iOS

3-

Ecran

16 millions de couleurs

4-

Mémoire interne

128 Go, 1 Go RAM
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Mireil
  • 183
  • 1
  • 9
  • Can you edit you question to include what you want to extract? – Code Different Jul 16 '15 at 12:09
  • I edited my question – Mireil Jul 16 '15 at 12:20
  • Try using 1) `(?s)(.*?)`, and 2) `(?s)(?:|\\G(?!^))(?:<[^<]+>)?(?!\\s+)([^<]*)(?:<[^<]+>)?`. See [demo](https://regex101.com/r/cM4yC5/1). – Wiktor Stribiżew Jul 16 '15 at 12:39
  • 2
    Try reading xml string to NSDictionary by using [xml parser](https://github.com/amarcadet/XMLReader) and then you can extract any values inside that. This will be a good approach rather than trying with string. – Akhilrajtr Jul 16 '15 at 12:42
  • @stribizhev: hi stribizhev, thank you for your solution, and the demo. it works great !! please post your solution as an answer so i can confirm it as the correct one. i am new in stack overflow so i can not vote your comment. – Mireil Jul 23 '15 at 09:02
  • @Mireil Please do not accept a regex-based answer. **Regular expressions are not suited for parsing XML, even if you think they are.** You should use an **XML parser** to parse XML. [Related](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – The Paramagnetic Croissant Jul 23 '15 at 09:11
  • @ The Paramagnetic Croissant: What I liked about the Regex solution, is that i shouldn't specify the XML tags and my XML tree has dynamic tags so I couldn't rely on the XML parsing. Because as i said, i dont have control on the tag names but only the tree structure. – Mireil Jul 23 '15 at 09:24

2 Answers2

2

Use XMLParser to read the string by

import "XMLReader.h"

NSData *data = [str dataUsingEncoding:NSUTF8StringEncoding];
NSError *error = nil;
NSDictionary *dict = [XMLReader dictionaryForXMLData:data error:&error];
NSArray *trArray = [dict valueForKeyPath:@"table.tbody.tr"];
NSArray *tdArray = [trArray valueForKey:@"td"];
NSInteger i = 1;
for (NSArray *tdItems in tdArray) {
    NSString *stringValue = @"";
    for (NSDictionary *td in tdItems) {
        if ([td valueForKey:@"strong"]) {
            NSDictionary *strong = [td valueForKey:@"strong"];
            if ([strong valueForKey:@"text"]) {
                stringValue = [stringValue stringByAppendingString:[NSString stringWithFormat:@"\n %@", [strong valueForKey:@"text"]]];
            }
        } else if ([td valueForKey:@"text"]) {
            stringValue = [stringValue stringByAppendingString:[NSString stringWithFormat:@"\n %@", [td valueForKey:@"text"]]];
        }
    }
    NSLog(@"%d- %@", i, stringValue);
    i++;
}
Community
  • 1
  • 1
Akhilrajtr
  • 5,170
  • 3
  • 19
  • 30
2

THE "RIGHT WAY" WITH HTML PARSER

You should know that whenever you have arbitrary HTML, you will need a HTML parser to get information from the HTML code, e.g. Ray Wenderlich's parser. Here is an example of using it (note that you want to get the contents of td nodes that have class attribute set to caracteristique - thus, XPath to be used is @"//tr/td[@class='caracteristique']"):

- (void)loadDataFromHtml {
    NSURL *url = [NSURL URLWithString:stringUrl];
    NSData *data = [NSData dataWithContentsOfURL:url];
    TFHpple *parser = [TFHpple hppleWithHTMLData:data];
    NSString *XpathQueryString = @"//tr/td[@class='caracteristique']"; // Here, we use the XPath
    NSArray *nodes = [parser searchWithXPathQuery:XpathQueryString];
    for (TFHppleElement *element in nodes) {
        NSLog(@"%@", [element content]);
    }
}

See more on this at Parse HTML in objective C, and How to Parse HTML on iOS.

REGEX FIX (SINCE OP REQUIRES IT)

Here are fixes for your regular expressions:

The first one should be 

(?s)<tr[^<]*>(.*?)</tr>

With [^<]* we make sure we are still inside <tr> tag and match all its attributes.

The second regex:

(?s)(?:<td\\b[^<]*>|\\G(?!^))(?:<[^<]+>)?(?!\\s+)([^<]*)(?:<[^<]+>)?

It matches all texts skipping tags. See demo.

Explanation:

  • (?s) - force single line mode when . matches a newline character
  • (?:<td\\b[^<]*>|(?!^)\\G) - sets the starting range location at <td...> or the end of previous match ((?!^)\\G).
  • (?:<[^<]+>)? - optionally matches a node element of type <...>
  • (?!\\s+)([^<]*) - matches text outside tags that is not whitespace
  • (?:<[^<]+>)? - optionally matches a node element of type <...>
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563