I want to extract value from a string which has unique starting and ending character. In my case its em
"Fully <em>Furni<\/em>shed |Downtown and Canal Views",
result
Furnished
I want to extract value from a string which has unique starting and ending character. In my case its em
"Fully <em>Furni<\/em>shed |Downtown and Canal Views",
result
Furnished
I guess you want to remove the tags.
If the backslash is only virtual the pattern is pretty simple: Basically <em>
with optional slash /?
let trimmedString = string.replacingOccurrences(of: "</?em>", with: "", options: .regularExpression)
Considering also the backslash it's
let trimmedString = string.replacingOccurrences(of: "<\\\\?/?em>", with: "", options: .regularExpression)
If you want to extract only Furnished
you have to capture groups: The string between the tags and everything after the closing tag until the next whitespace character.
let string = "Fully <em>Furni<\\/em>shed |Downtown and Canal Views"
let pattern = "<em>(.*)<\\\\?/em>(\\S+)"
do {
let regex = try NSRegularExpression(pattern: pattern)
if let match = regex.firstMatch(in: string, range: NSRange(string.startIndex..., in: string)) {
let part1 = string[Range(match.range(at: 1), in: string)!]
let part2 = string[Range(match.range(at: 2), in: string)!]
print(String(part1 + part2))
}
} catch { print(error) }
Given this string:
let str = "Fully <em>Furni<\\/em>shed |Downtown and Canal Views"
and the corresponding NSRange
:
let range = NSRange(location: 0, length: (str as NSString).length)
Let's construct a regular expression that would match letters between <em>
and </em>
, or preceded by </em>
let regex = try NSRegularExpression(pattern: "(?<=<em>)\\w+(?=<\\\\/em>)|(?<=<\\\\/em>)\\w+")
What it does is :
\\w+
, <em>
: (?<=<em>)
(positive lookbehind),<\/em>
: (?=<\\\\/em>)
(positive lookahead), |
\\w+
,<\/em>
: (?=<\\\\/em>)
(positive lookbehind)Let's get the matches:
let matches = regex.matches(in: str, range: range)
Which we can turn into substrings:
let strings: [String] = matches.map { match in
let start = str.index(str.startIndex, offsetBy: match.range.location)
let end = str.index(start, offsetBy: match.range.length)
return String(str[start..<end])
}
Now we can join the strings in even indices, with the ones in odd indices:
let evenStride = stride(from: strings.startIndex,
to: strings.index(strings.endIndex, offsetBy: -1),
by: 2)
let result = evenStride.map { strings[$0] + strings[strings.index($0, offsetBy: 1)]}
print(result) //["Furnished"]
We can test it with another string:
let str2 = "<em>Furni<\\/em>shed <em>balc<\\/em>ony <em>gard<\\/em>en"
the result would be:
["Furnished", "balcony", "garden"]
Not a regex but, for obtaining all words in tags, e.g [Furni, sma]:
let text = "Fully <em>Furni<\\/em>shed <em>sma<\\/em>shed |Downtown and Canal Views"
let emphasizedParts = text.components(separatedBy: "<em>").filter { $0.contains("<\\/em>")}.flatMap { $0.components(separatedBy: "<\\/em>").first }
For full words, e.g [Furnished, smashed]:
let emphasizedParts = text.components(separatedBy: " ").filter { $0.contains("<em>")}.map { $0.replacingOccurrences(of: "<\\/em>", with: "").replacingOccurrences(of: "<em>", with: "") }
Regex:
If you want to achieve that by regex, you can use Valexa's answer:
public extension String {
public func capturedGroups(withRegex pattern: String) -> [String] {
var results = [String]()
var regex: NSRegularExpression
do {
regex = try NSRegularExpression(pattern: pattern, options: [])
} catch {
return results
}
let matches = regex.matches(in: self, options: [], range: NSRange(location:0, length: self.count))
guard let match = matches.first else { return results }
let lastRangeIndex = match.numberOfRanges - 1
guard lastRangeIndex >= 1 else { return results }
for i in 1...lastRangeIndex {
let capturedGroupIndex = match.range(at: i)
let matchedString = (self as NSString).substring(with: capturedGroupIndex)
results.append(matchedString)
}
return results
}
}
like this:
let text = "Fully <em>Furni</em>shed |Downtown and Canal Views"
print(text.capturedGroups(withRegex: "<em>([a-zA-z]+)</em>"))
result:
["Furni"]
NSAttributedString:
If you want to do some highlighting or you only need to get rid of tags or any other reason that you can't use the first solution, you can also do that using NSAttributedString
:
extension String {
var attributedStringAsHTML: NSAttributedString? {
do{
return try NSAttributedString(data: Data(utf8),
options: [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue],
documentAttributes: nil)
}
catch {
print("error: ", error)
return nil
}
}
}
func getTextSections(_ text:String) -> [String] {
guard let attributedText = text.attributedStringAsHTML else {
return []
}
var sections:[String] = []
let range = NSMakeRange(0, attributedText.length)
// we don't need to enumerate any special attribute here,
// but for example, if you want to just extract links you can use `NSAttributedString.Key.link` instead
let attribute: NSAttributedString.Key = .init(rawValue: "")
attributedText.enumerateAttribute(attribute,
in: range,
options: .longestEffectiveRangeNotRequired) {attribute, range, pointer in
let text = attributedText.attributedSubstring(from: range).string
sections.append(text)
}
return sections
}
let text = "Fully <em>Furni</em>shed |Downtown and Canal Views"
print(getTextSections(text))
result:
["Fully ", "Furni", "shed |Downtown and Canal Views"]
Here is basic implementation in PHP (yes, I know you asked Swift, but it's to demonstrate the regex part):
<?php
$in = "Fully <em>Furni</em>shed |Downtown and Canal Views";
$m = preg_match("/<([^>]+)>([^>]+)<\/\\1>([^ ]+|$)/i", $in, $t);
$s = $t[2] . $t[3];
echo $s;
Output:
ZC-MGMT-04:~ jv$ php -q regex.php
Furnished
Obviously, the most important bit is the regular expression part which would match any tag and find a respective closing tag and reminder afterward
If you just want to extract the text between <em>
and <\/em>
(note this is not normal HTML tags as then it would have been <em>
and </em>
) tags, we can simply capture this pattern and replace it with the group 1's value captured. And we don't need to worry about what is present around the matching text and just replace it with whatever got captured between those text which could actually be empty string, because OP hasn't mentioned any constraint for that. The regex for matching this pattern would be this,
<em>(.*?)<\\\/em>
OR to be technically more robust in taking care of optional spaces (as I saw someone pointing out in comment's of other answers) present any where within the tags, we can use this regex,
<\s*em\s*>(.*?)<\s*\\\/em\s*>
And replace it with \1
or $1
depending upon where you are doing it. Now whether these tags contain empty string, or contains some actual string within it, doesn't really matter as shown in my demo on regex101.
Let me know if this meets your requirements and further, if any of your requirement remains unsatisfied.
I highly recommend the use of regex capture groups.
let capturePattern = "(?<=<em>)(?<data1>\\w+)(?=<\\\\/em>)|(?<=<\\\\/em>)(?<data2>\\w+)"
let captureRegex = try! NSRegularExpression(
pattern: capturePattern,
options: []
)
let textInput = "Fully <em>Furni<\/em>shed |Downtown and Canal Views"
let textInputRange = NSRange(
textInput.startIndex..<textInput.endIndex,
in: textInput
)
let matches = captureRegex.matches(
in: textInput,
options: [],
range: textInputRange
)
guard let match = matches.first else {
// Handle exception
throw NSError(domain: "", code: 0, userInfo: nil)
}
let data1Range = match.range(withName: "data1")
// Extract the substring matching the named capture group
if let substringRange = Range(data1Range, in: textInput) {
let capture = String(textInput[substringRange])
print(capture)
}
The same can be done to get the data2
group name:
let data2Range = match.range(withName: "data2")
if let substringRange = Range(data2Range, in: textInput) {
let capture = String(textInput[substringRange])
print(capture)
}
This method's main advantage is the group index independency. This makes this use less attached to the regex expression.