exclusively apply java pattern matcher to extract html elements, ignore some characters

Question

I'm using this code:

Pattern pat_1 = Pattern.compile("class=\"\"pinyin\"\">(.*?)<script>");
Matcher mat_1 = pat_1.matcher( text );
while( mat_1.find() )
{
    System.out.println( mat_1.group(1) );
}

This is the input data source bring matched:

<br>
<span class=""b"">拼音：</span><span class=""pinyin"">xī<script>Setduyin('Duyin/xi1')</script></span>　<span class=""b"">注音：</span><span class=""pinyin"">ㄒㄧ<script>Setduyin('Duyin/xi1')</script></span><br>
<span class=""b"">简体部首：</span>丨　<span class=""b"">部首笔画：</span>1　<span class=""b"">总笔画：</span>8<br><span class=""b"">繁体部首：</span>卜　<span class=""b"">部首笔画：</span>2　<span class=""b"">总笔画：</span>8<br><span class=""b"">康熙字典笔画</span>( 卥:8； )

The problem with my code is that it also picks up ㄒㄧ because the preceding and proceding elements are identical. How could I exclude ㄒㄧ and only select xī. maybe I can use the <br> tag because that is something unique to the first once, but that necessitates identifying a new line and also ignoring 拼音： how to do that? I've been playing around with regex101.com but I've not yet been able to pin it down.

So to be clear right now the output of that java code is

xī
ㄒㄧ

but I want it only to be

xī

Avoid parsing HTML with a regex. Read [this answer](http://stackoverflow.com/a/6752487/4125191) if you want to understand why. You also didn't specify what makes "TI" bad. What happens if you have "ABC" or something like that? — RealSkeptic, Feb 12 '15 at 07:12
Do you need the extraction of strings not being "ㄒㄧ" only in this particular relative arrangement? (xi (or yi or...) before T-?) Or could the sequence also be "T-", "xi", "yi", "T-"? — laune, Feb 12 '15 at 07:19
What do you mean "the text that comes just after that ``? EVerything to the end of the string? — laune, Feb 12 '15 at 07:21
@Yamada please don't update your question 5 mins after posting. — Avinash Raj, Feb 12 '15 at 07:22
@RealSkeptic the whole reason I'm doing it this way is that XPath failed me. I originally tried very hard to extract these elements with XPath but nothing worked. — , Feb 12 '15 at 07:23
@AvinashRaj I see that is bad, I'll refrain from it in future — , Feb 12 '15 at 07:25
@AvinashRaj Can you show a reference why that is not allowed? Usually we *ask* the users to update their questions to improve them. — RealSkeptic, Feb 12 '15 at 07:26
@RealSkeptic edit within 5 mins grace period won't be a matter. First i answered his real question. And op says that it's working. So i moved on to the next question. After sometime, i received a downvote that op updated his question yours won't work. It's like a kind of **** — Avinash Raj, Feb 12 '15 at 07:29
@RealSkeptic I think because my update was highly tangential of original topic — , Feb 12 '15 at 07:30
@AvinashRaj who downvote you? I upvote you and accept your answer. sorry to cause you a frustration. — , Feb 12 '15 at 07:32

Avinash Raj · Accepted Answer · 2015-02-12T07:46:10.867

1

You could try the below regex.

Pattern pat_1 = Pattern.compile("class=\"\"pinyin\"\">(.*?)<script>(?:(?!<script>).)*");

DEMO

OR

(?m)^.*?class=\"\"pinyin\"\">(.*?)<script>

(?m) called multiline modifier, it's safe to enable this modifier when anchors ^, $ are used in the regex.

DEMO

edited Feb 12 '15 at 07:46

answered Feb 12 '15 at 07:04

Avinash Raj

172,303
28
230
274

incredible--> exactly perfect. so this mean ignore (?: – Feb 12 '15 at 07:06
@laune what do you mean by that? – Avinash Raj Feb 12 '15 at 07:17
but the data won't have that, "T-" will always come after "xi" – Feb 12 '15 at 07:19
@Yamada_Tarō Then why do you use "while"??? A simple if will get you the xi! Or can there be multiple "xi" before the "T-"??? – laune Feb 12 '15 at 07:22
because there are many many datas, all have the same structure i want all the equivalents of 'xi' maybe next one is 'yi' but still it will be before "T/" do you know what I mean? – Feb 12 '15 at 07:24
Exactly one xi or yi or zi before the one "T-"? Then use if! Not while! – laune Feb 12 '15 at 07:25
@laune you're helping me to see that and improve myself. thank you. – Feb 12 '15 at 07:33
@AvinashRaj I have tried a modified text with "xi"-"xi"-"T-" and "T-"-"xi"-"T-", with your pattern, not knowing that the task is simply to extract *only the first text string* between pinyin and script. I'll remove my answer, cancel your downvote (although no complex regexes are required) and downvote the Q. – laune Feb 12 '15 at 07:35

exclusively apply java pattern matcher to extract html elements, ignore some characters

1 Answers1

Linked