How can I get the text between two constant text?
Example:
<rate curr="KRW" unit="100">19,94</rate>
19,94
is between
"<rate curr="KRW" unit="100">"
and
"</rate>"
Other example:
ABCDEF
getting substring between
AB
and EF
= CD
How can I get the text between two constant text?
Example:
<rate curr="KRW" unit="100">19,94</rate>
19,94
is between
"<rate curr="KRW" unit="100">"
and
"</rate>"
Other example:
ABCDEF
getting substring between
AB
and EF
= CD
Try with:
/<rate[^>]*>(.*?)<\/rate>/
However it is better NOT TO USE REGEX WITH HTML.
The way I do it is using the match all
matched = Regex.Matches(result, @"(?<=<rate curr=\"KRW\" unit=\"100\">)(.*?)(?=</rate>)");
Then get one by one using match[i].Groups[1].value
If you're analyzing HTML, you're probably better off going with javascript and .innerHTML(). Regex is a bit overkill.
The simple regex matching string you're looking for is:
(?<=<rate curr=\"KRW\" unit=\"100\">)(.*?)(?=</rate>)
In Ruby, for example, this would translate to:
string = '<rate curr="KRW" unit="100">19,94</rate>'
string.match("(?<=<rate curr=\"KRW\" unit=\"100\">)(.*?)(?=</rate>)").to_s
# => "19,94"
Thanks to Will Yu.
If you want a generic solution, i.e to find a string between two strings You may use Pattern.quote()
[or wrap string with \Q
and \E
around] to quote start and end strings and use (.*?)
for a non greedy match.
See an example of its use in below snippet
@Test
public void quoteText(){
String str1 = "<rate curr=\"KRW\" unit=\"100\">";
String str2 = "</rate>";
String input = "<rate curr=\"KRW\" unit=\"100\">19,94</rate>"
+"<rate curr=\"KRW\" unit=\"100\"></rate>"
+"<rate curr=\"KRW\" unit=\"100\">19,96</rate>";
String regex = Pattern.quote(str1)+"(.*?)"+Pattern.quote(str2);
System.out.println("regex:"+regex);
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(input);
while(m.find()){
String group = m.group(1);
System.out.println("--"+group);
}
Output
regex:\Q<rate curr="KRW" unit="100">\E(.*?)\Q</rate>\E
--19,94
--
--19,96
Note:Though its not recommended to use regex to parse entire HTML, I think there is no harm in conscious use of regex while treating HTML as plain text
I suggest that you use an HTML parser. The grammar that defines HTML is a context-free grammar, which is fundamentally too complex to be parsed by regular expressions. Even if you manage to write a regular expression that will achieve what you want, but will probably fail on some corner cases.
For instance, what if you are expected to parse the following HTML?
<rate curr="KRW" unit="100"><rate curr="KRW" unit="100">19,94</rate></rate>
A regular expression may not handle this corner case properly.