Java regex , jsoup

Question

How to extract these messages by regex or jsoup ? 19040172b-1、 SQL Server Develop 、zheng 、3-5,7-14 、D-101 ，

<div id="AE9D7F630640426F8457A661607D2B8E-5-2" style="display: none;" class="kbcontent">
  19040172b-1
  <br>SQL Server Develop
  <br>
  <font title="teacher">zheng</font>
  <br>
  <font title="week">3-5,7-14</font>
  <br>
  <font title="classroom">D-101</font>
  <br>
 </div>

I have tried the following ways but failed.

1. Pattern pattern = Pattern.compile(">(.*?)<br>");

2. Elements msg = doc.select(":matchesOwn([>.*?<br>])");

score 1 · Answer 1 · edited May 23 '17 at 11:51

1

1) First, it's never a good idea to parse HTMl with a regex. You can read more about that here.

2)You can just take all text between tags.

Document doc = Jsoup.parse(file, charsetName);
String text= doc.text();
System.out.println(text);

edited May 23 '17 at 11:51

Community

1
1

answered Sep 08 '16 at 08:46

dakatamen

11
3

score 0 · Accepted Answer · answered Sep 08 '16 at 08:31

String html = "<div id=\"AE9D7F630640426F8457A661607D2B8E-5-2\" style=\"display: none;\" class=\"kbcontent\">  19040172b-1  <br>SQL Server Develop  <br>  <font title=\"teacher\">zheng</font>  <br>  <font title=\"week\">3-5,7-14</font>  <br>  <font title=\"classroom\">D-101</font>  <br> </div> ";
html = html.replaceAll("<br>", "#~#");
Document doc = Jsoup.parse(html.toString());
String newHtml = doc.text();
String[] ary = newHtml.split("#~#");

This will do the job, yet there may be other clean ways to replace the br tag.

Java regex , jsoup

2 Answers2