I'm using this code:
Pattern pat_1 = Pattern.compile("class=\"\"pinyin\"\">(.*?)<script>");
Matcher mat_1 = pat_1.matcher( text );
while( mat_1.find() )
{
System.out.println( mat_1.group(1) );
}
This is the input data source bring matched:
<br>
<span class=""b"">拼音:</span><span class=""pinyin"">xī<script>Setduyin('Duyin/xi1')</script></span> <span class=""b"">注音:</span><span class=""pinyin"">ㄒㄧ<script>Setduyin('Duyin/xi1')</script></span><br>
<span class=""b"">简体部首:</span>丨 <span class=""b"">部首笔画:</span>1 <span class=""b"">总笔画:</span>8<br><span class=""b"">繁体部首:</span>卜 <span class=""b"">部首笔画:</span>2 <span class=""b"">总笔画:</span>8<br><span class=""b"">康熙字典笔画</span>( 卥:8; )
The problem with my code is that it also picks up ㄒㄧ
because the preceding and proceding elements are identical. How could I exclude ㄒㄧ
and only select xī
. maybe I can use the <br>
tag because that is something unique to the first once, but that necessitates identifying a new line and also ignoring 拼音:
how to do that? I've been playing around with regex101.com but I've not yet been able to pin it down.
So to be clear right now the output of that java code is
xī
ㄒㄧ
but I want it only to be
xī