I would like to use some text in my app that is kind of messy. I don't have control over the text, so it is what it is.
I'm looking for a light weight1 approach to cleaning up all the things shown in the examples here:
original: <p>Occasionally we deal with this.</p> desired: Occasionally we deal with this.
original: <p>Sometimes they \emphasize\ like this, I could live with it</p> desired: Sometimes they emphasize like this, I could live with it
original: <p>This is junk, but it's what I have<\/p>\r\n desired: This is junk, but it's what I have
original: <p>This is test1</p> desired: This is test1
original: <p>This is u\u00f1icode</p> desired: This is uñicode
So we see special characters, like
unicode, like \u00f1
, html paragraph, like <p>
and </p>
, new line stuff, like \n\r
, and just weird backslashes \
in places. The desired is translating the translatable and removing the other junk.
Although it's possible for me to manipulate the strings directly, taking care of each of these things individually, I wondered if there was a simple1 way to clean up these strings without too much overhead1.
A partial answer is provided already, but there are more problems to fix in the examples I've provided. That solution translates HTML special characters, but no unicode formatted as \u0000
, not removing HTML tags, etc.
Additional Things I Tried
This is not the global solution I was looking for, but it shows the direction one could go to solve the problem.
let samples = ["<p>This is test1</p> ":"This is test1",
"<p>This is u\\u00f1icode</p> ":"This is u–icode",
"<p>This is uñicode</p> ":"This is u–icode",
"<p>This is junk, but it's what I have<\\/p>\\r\\n ":"This is junk, but it's what I have",
"<p>Sometimes they \\emphasize\\ like this, I could live with it</p>":"Sometimes they emphasize like this, I could live with it",
"<p>Occasionally we deal with this.</p> ":"Occasionally we deal with this."]
for (key, value) in samples {
print ("original: \(key) desired: \(value)" )
}
print("\n\n\n")
for (key, _) in samples {
var _key = key.trimmingCharacters(in: CharacterSet.whitespaces)
_key = _key.replacingOccurrences(of: "\\/", with: "/")
if _key.hasSuffix("\\r\\n") { _key = String(_key.dropLast(4)) }
if _key.hasPrefix("<p>") { _key = String(_key.dropFirst(3)) }
if _key.hasSuffix("</p>") { _key = String(_key.dropLast(4)) }
while let uniRange = _key[_key.startIndex...].range(of: "\\u") {
let charDefRange = uniRange.upperBound..<_key.index(uniRange.upperBound, offsetBy: 4)
let uniFullRange = uniRange.lowerBound..<charDefRange.upperBound
let charDef = "&#x" + _key[charDefRange] + ";"
_key = _key.replacingCharacters(in: uniFullRange, with: charDef)
}
let decoded = _key.stringByDecodingHTMLEntities
print("decoded: \(decoded)")
}
OUTPUT
original: <p>Occasionally we deal with this.</p> desired: Occasionally we deal with this.
original: <p>Sometimes they \emphasize\ like this, I could live with it</p> desired: Sometimes they emphasize like this, I could live with it
original: <p>This is uñicode</p> desired: This is uñicode
original: <p>This is junk, but it's what I have<\/p>\r\n desired: This is junk, but it's what I have
original: <p>This is test1</p> desired: This is test1
original: <p>This is u\u00f1icode</p> desired: This is uñicode
decoded: Occasionally we deal with this.
decoded: Sometimes they \emphasize\ like this, I could live with it
decoded: This is uñicode
decoded: This is junk, but it's what I have
decoded: This is test1
decoded: This is uñicode
Footnotes: 1. There are probably many larger packages or libraries that could do this as a very small part of their total functionality, and those are of less interest here.