-1

How do I modify the following String manipulation to look for "text to extract" in the HTML code below ? I don't understand the "(?<=')[^']+" I understand it is a regex pattern and I looked on a website but I don't get the logic of it... Maybe if someone show me the way with my question I could understand better..

if let match = dataString?.range(of: "(?<=')[^']+", options: .regularExpression) {
                        print(dataString?.substring(with: match) as Any)

HTML code:

 <span class="phrase">Text to Extract</span></span></span></p> 
Dev0urCode
  • 83
  • 10
  • 3
    Do not parse HTML with regexp: https://stackoverflow.com/a/1732454/8332700 – Verv Aug 24 '17 at 18:42
  • To put it in simple terms, it's a pattern that matches one or more characters that aren't `'`, preceded by a `'`. But as @Verv said, do not use regex to parse HTML. Instead try a solution here: https://stackoverflow.com/questions/31080818/what-is-the-best-practice-to-parse-html-in-swift – CAustin Aug 24 '17 at 18:44
  • sorry I forgot to mention the html has been downloaded and encoded into UTF8 string.. Does that work ? – Dev0urCode Aug 24 '17 at 18:44
  • You can use `NSAttributedString`: https://stackoverflow.com/questions/23757655/how-to-remove-html-tags-from-nsstring-in-iphone – Larme Aug 24 '17 at 20:00

1 Answers1

2

First, https://regex101.com/ is a free online resource where you can test regex, and it will explain what each part of it is doing.

The regex (?<=')[^']+ can be broken down as follows

(?<=<token>) is a positive look-behind for a token. In this case, the char single-quote (')
[^<chars>] match anything not one of the following characters. In this case, the char single-quote (')
+ match the previous token 1 or more times. In this case, [^']

So the above regex matches anything between two '. Note that this has no concept of opening and closing, so a'b'c'd'e would match b, c, and d.

To match a literal phrase, you would just use that phrase in your regex (escaping any regex special characters with \).

If you need context aware (nest tracking) extraction, any regex will be inherently wrong, and you will need an HTML parser to extract it for you.

Tezra
  • 8,463
  • 3
  • 31
  • 68
  • Thanks, really helpful website ! Can I use regex on HTML encoded as String UTF8 or should I look elsewhere ? – Dev0urCode Aug 24 '17 at 18:55
  • @Dev0urCode You can use regex on text that happens to be HTML. Regex has no concept of nesting, HTML can work even if it's malformed (like missing ``), and regex only matches one continuous pattern. So Regex can find the pattern `key-value:"rawr"` and extract rawr, but it can't extract "help me" from `help me`. If you care about respecting the HTML syntax, regex is not nearly powerful enough to handle it. Regex only works if you can treat the string like any random blob of text. – Tezra Aug 24 '17 at 19:01
  • can you help ? Im writing `code if let match = dataString?.range(of: "(?<=)[^<]+", options: .regularExpression)` but xcode detects the regex as code(Use of unresolved identifier 'phrase') how do you format in swift ? – Dev0urCode Aug 24 '17 at 21:09
  • @Dev0urCode You need to escape `"` in your string. The regex is right, but it must be formatted as a proper string in swift too. – Tezra Aug 24 '17 at 21:21
  • sorry but I began coding 3 weeks ago; what do you mean by "escape" ? delete ? create a separate variable somehow ? tried deleting already but then i don't get the value I want – Dev0urCode Aug 24 '17 at 21:26
  • @Dev0urCode An [Escape Charactur](https://en.wikipedia.org/wiki/Escape_character) changes how the compiler reacts to certain charactures. The `"` in your string closes the string, but using `\"` instead will treat it as the literal `"` in the string. So replace `"phrase"` with `\"phrase\"`. – Tezra Aug 24 '17 at 21:43
  • I had no idea I could do that ! That's great, it worked like a charm ! thanks a bunch ! – Dev0urCode Aug 24 '17 at 21:47