0

I would like to use some text in my app that is kind of messy. I don't have control over the text, so it is what it is.

I'm looking for a light weight1 approach to cleaning up all the things shown in the examples here:

original: <p>Occasionally we&nbsp;deal&nbsp;with this.</p>                       desired: Occasionally we deal with this.
original: <p>Sometimes they \emphasize\ like this, I could live with it</p>      desired: Sometimes they emphasize like this, I could live with it
original: <p>This is junk, but it's what I have<\/p>\r\n                         desired: This is junk, but it's what I have
original: <p>This is test1</p>                                                   desired: This is test1
original: <p>This is u\u00f1icode</p>                                            desired: This is uñicode

So we see special characters, like &nbsp; unicode, like \u00f1, html paragraph, like <p> and </p>, new line stuff, like \n\r, and just weird backslashes \ in places. The desired is translating the translatable and removing the other junk.

Although it's possible for me to manipulate the strings directly, taking care of each of these things individually, I wondered if there was a simple1 way to clean up these strings without too much overhead1.

A partial answer is provided already, but there are more problems to fix in the examples I've provided. That solution translates HTML special characters, but no unicode formatted as \u0000, not removing HTML tags, etc.

Additional Things I Tried

This is not the global solution I was looking for, but it shows the direction one could go to solve the problem.

let samples = ["<p>This is test1</p>                                             ":"This is test1",
           "<p>This is u\\u00f1icode</p>                                      ":"This is u–icode",
           "<p>This is u&#x00f1;icode</p>                                       ":"This is u–icode",
           "<p>This is junk, but it's what I have<\\/p>\\r\\n                   ":"This is junk, but it's what I have",
           "<p>Sometimes they \\emphasize\\ like this, I could live with it</p>":"Sometimes they emphasize like this, I could live with it",
           "<p>Occasionally we&nbsp;deal&nbsp;with this.</p>                 ":"Occasionally we deal with this."]

for (key, value) in samples {
    print ("original: \(key)      desired: \(value)" )
}

print("\n\n\n")

for (key, _) in samples {
    var _key = key.trimmingCharacters(in: CharacterSet.whitespaces)
    _key = _key.replacingOccurrences(of: "\\/", with: "/")

    if _key.hasSuffix("\\r\\n") { _key = String(_key.dropLast(4)) }
    if _key.hasPrefix("<p>") { _key = String(_key.dropFirst(3)) }
    if _key.hasSuffix("</p>") { _key = String(_key.dropLast(4)) }

    while let uniRange = _key[_key.startIndex...].range(of: "\\u") {
        let charDefRange = uniRange.upperBound..<_key.index(uniRange.upperBound, offsetBy: 4)
        let uniFullRange = uniRange.lowerBound..<charDefRange.upperBound
        let charDef = "&#x" + _key[charDefRange] + ";"

        _key = _key.replacingCharacters(in: uniFullRange, with: charDef)
    }

    let decoded = _key.stringByDecodingHTMLEntities
    print("decoded: \(decoded)")
}

OUTPUT

original: <p>Occasionally we&nbsp;deal&nbsp;with this.</p>                       desired: Occasionally we deal with this.
original: <p>Sometimes they \emphasize\ like this, I could live with it</p>      desired: Sometimes they emphasize like this, I could live with it
original: <p>This is u&#x00f1;icode</p>                                          desired: This is uñicode
original: <p>This is junk, but it's what I have<\/p>\r\n                         desired: This is junk, but it's what I have
original: <p>This is test1</p>                                                   desired: This is test1
original: <p>This is u\u00f1icode</p>                                            desired: This is uñicode




decoded: Occasionally we deal with this.
decoded: Sometimes they \emphasize\ like this, I could live with it
decoded: This is uñicode
decoded: This is junk, but it's what I have
decoded: This is test1
decoded: This is uñicode

Footnotes: 1. There are probably many larger packages or libraries that could do this as a very small part of their total functionality, and those are of less interest here.

Dale
  • 5,520
  • 4
  • 43
  • 79
  • 1
    This should do the job: [How do I decode HTML entities in swift?](https://stackoverflow.com/questions/25607247/how-do-i-decode-html-entities-in-swift) – Martin R Apr 17 '18 at 18:31
  • 1
    Concepts "lightweight", "simple" and "without too much overhead" seem open-ended / opinion-based. Do you just mean "I'd prefer not to do any actual programming"? As you say yourself, it is what it is. – matt Apr 17 '18 at 18:33
  • What I mean is that if someone has encountered this problem before, there's little need for me to reinvent a solution. – Dale Apr 17 '18 at 18:41
  • So you're just asking that we do the searching for you. And Martin R has done that (and given a great answer). I suggest you upvote his answer answer and delete this one, which is effectively a duplicate. – matt Apr 17 '18 at 19:07
  • @MartinR, that answer helps with a portion of the problem: the HTML special characters. Thanks. I use JSoup in another programming domain that does all of the cleaning. But I'm new in this domain, so started my search here on SO. – Dale Apr 17 '18 at 19:38

1 Answers1

1

I cannot understand the weird backslashes but to remove HTML tags, HTML entities and escapes, you can do the following replacements using regular expressions:

Note that you need a dictionary of HTML entities otherwise this won't work. The number of escapes is small and creating the full dictionary wont' be complicated.

let strings = [
    "<p>Occasionally we&nbsp;deal&nbsp;with this.</p> ",
    "<p>Sometimes they \\emphasize\\ like this, I could live with it</p>",
    "<p>This is junk, but it's what I have<\\/p>\\r\\n",
    "<p>This is test1</p>",
    "<p>This is u\\u00f1icode</p>",
]

// the pattern needs exactly one capture group
func replaceEntities(in text: String, pattern: String, replace: (String) -> String?) -> String {
    let buffer = (text as NSString).mutableCopy() as! NSMutableString
    let regularExpression = try! NSRegularExpression(pattern: pattern, options: .caseInsensitive)

    let matches = regularExpression.matches(in: text, options: [], range: NSRange(location: 0, length: buffer.length))

    // need to replace from the end or the ranges will break after first replacement
    for match in matches.reversed() {
        let captureGroupRange = match.range(at: 1)
        let matchedEntity = buffer.substring(with: captureGroupRange)
        guard let replacement = replace(matchedEntity) else {
            continue
        }
        buffer.replaceCharacters(in: match.range, with: replacement)
    }

    return buffer as String
}

let htmlEntities = [
    "nbsp": "\u{00A0}"
]

func replaceHtmlEntities(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "&([^;]+);") {
        return htmlEntities[$0]
    }
}

let escapeSequences = [
    "n": "\n",
    "r": "\r"
]

func replaceEscapes(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "\\\\([a-z])") {
        return escapeSequences[$0]
    }
}

func removeTags(_ text: String) -> String {
    return text
        .replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression)
}

func replaceUnicodeSequences(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "\\\\u([a-z0-9]{4})") {
        let code = Unicode.Scalar(Int($0, radix: 16)!)
        return code.map { String($0) }
    }
}

let purifiedStrings = strings
    .map(removeTags)
    .map(replaceHtmlEntities)
    .map(replaceEscapes)
    .map(replaceUnicodeSequences)

print(purifiedStrings.joined(separator: "\n"))

You can also replace leading/trailing strings and replace multiple spaces by a single space but that's trivial.

You can combine it with the solutions in How do I decode HTML entities in swift?

Sulthan
  • 128,090
  • 22
  • 218
  • 270
  • Sure, this is the kind of thing I would do too. But I have to wonder how this is "lightweight and simple". That's my problem with the OP's question. Instead of asking how to do it, he orders us not to give certain kinds of answer (without being altogether clear on what kind of answer he is forbidding). – matt Apr 20 '18 at 17:31