Parse the html code or use regex with java?

Question

I'm trying to extract the values of this piece of html code:

<ul id="tree-dotlrn_class_instance">
<li>
      <a href="/dotlrn/classes/c033/13000/c12c033a13000gA/">**2011-12 Ampl.Arquit.Computadors Gr.A  (13000)**</a>
<ul>
    <li>
        <a href="/dotlrn/classes/c033/13022/c12c033a13022gA/c12c033a13022gAsT00/">**2011-12 Entorns d'Usuari Gr.A  Sgr.T00 (13022)** </a>
    </li>
    <li>
      <a href="/dotlrn/classes/c033/13036/c12c033a13036gA/c12c033a13036gAsT00/">**2011-12 Eng.Serv.Telemàtics Gr.A  Sgr.T00 (13036)** </a>
    </li>
</ul>
</li>

<li>
      <a href="/dotlrn/classes/c033/13038/c12c033a13038gA/">**2011-12 Intel·lig.Artif.Enginyer.Coneixem. Gr.A  (13038)**</a>
</li>

<li>
      <a href="/dotlrn/classes/c033/13048/c12c033a13048gA/">**2011-12 Processad.Llenguatge Gr.A  (13048)**</a>
<ul>
    <li>
        <a href="/dotlrn/classes/c033/13048/c12c033a13048gA/c12c033a13048gAsL01/">**2011-12 Processad.Llenguatge Gr.A  Sgr.L01 (13048)** </a>
    </li>
    <li>
      <a href="/dotlrn/classes/c033/13048/c12c033a13048gA/c12c033a13048gAsT00/">**2011-12 Processad.Llenguatge Gr.A  Sgr.T00 (13048)** </a>
    </li>
    <li>
      <a href="/dotlrn/classes/c033/13052/c12c033a13052gA/c12c033a13052gAsL02/">**2011-12 Sist.Basats Microprocessadors Gr.A  Sgr.L02 (13052)** </a>
    </li>
</ul>
</li>

<li>
      <a href="/dotlrn/classes/c033/13055/c12c033a13055gAA/">**2011-12 Sist.Informàtics Gr.AA (13055)**</a>
</li>

<li>
      <a href="/dotlrn/classes/c033/14009/c12c033a14009gA/">**2011-12 Administrac. Gestió de Xarxes Gr.A  (14009)**</a>
</li>

<li>
      <a href="/dotlrn/classes/c033/15656/c12c033a15656gA/">**2011-12 Transmissió de Dades Gr.A**  (15656)</a>        
</li>
</ul>

All that it's in strong black (between**)with his href value into a HashMap. First I try with jericho html parser but I think is so complicated, then I try with Regex, but I don't know how to do it exactly. Can you help me ??

Thanks!

Update: I'm trying this, but it's not the right way.

Source s = new Source(answer);
    List<Element> Form1 = s.getAllElements(HTMLElementName.UL);
    int tam1 = Form1.size();
        for(int j = 0; j < tam1; j++){
            Element e1 = Form1.get(j);
            if("tree-dotlrn_class_instance".equals(e1.getAttributeValue("id"))){
                List<Element> L1 = e1.getAllElements(HTMLElementName.UL);
                for (int k = 0; k < L1.size(); k++){
                    Element e2 = L1.get(k);
                    System.out.println("Elemento de la lista L1: "+e2.getContent());
                    List<Element> L2 = e2.getAllElements(HTMLElementName.LI);
                    for(int m = 0; m < L2.size(); m++){
                        Element e3 = L2.get(m);
                        System.out.println("Elemento de la lista L2: "+e3.getContent());
                        asignaturas.add(e3.getContent().toString());
                        System.out.println("Lista de asignaturas "+m+" "+asignaturas.get(0));
                    }
                }

            }
        }

[Never parse HTML/XML with regexes](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) — m0skit0, Jan 08 '13 at 16:41
There's nothing in **strong black** in your `**code block**`. Next, HTML is not [Regular](http://en.wikipedia.org/wiki/Regular_language) so you can't use *Regular* Expressions to parse it reliably. — Richard JP Le Guen, Jan 08 '13 at 16:42
Just read @m0skit0 's link. But there is a valid case which is very narrow. That's the case where the HTML is generated on-the-fly by some application and you either own the application or otherwise know when it changes what it generates (or only need it for the week and assume it won't change this week). Then you can parse what will be well-formed HTML. Its a pretty selective case but the example HTML seems to fall into this case. It just depends. — Lee Meador, Jan 08 '13 at 17:06
There is nothing in **strong black** because it's code and then this property doesn't' apply, but I put the ** in the part of the code I want to extract. Ok, now I decided to parse with a HTML parser, how can I do, because I want all the href with the text that it's associated in the browser view. — Carlos del Blanco, Jan 08 '13 at 17:26
@LeeMeador and you will keep updating the Java app each time the output of the other app changes? Poor foresight IMHO. Do it right the first time and you're done. — m0skit0, Jan 08 '13 at 22:02
Fair enough. I think I qualified my description of when it might be useful. Its pretty limited. If it changes much, as you point out, a regex would sign you up for a lot of ongoing work. — Lee Meador, Jan 08 '13 at 22:05

nicholas.hauschild · Answer 1 · 2013-01-08T16:48:26.927

5

Take a look at JSoup's selector syntax.

If you are looking for all a elements with an href attribute, you can find them like this:

String theHtmlInYourExample = "...";
Document doc = Jsoup.parse(theHtmlInYourExample);
Elements links = doc.select("a[href]");

From there, you should be able to extract the text of the element and the value of the href attribute to create your HashMap.

edited Jan 08 '13 at 16:48

answered Jan 08 '13 at 16:39

nicholas.hauschild

42,483
9
127
120

Not all the elements, only the elements in this list – Carlos del Blanco Jan 08 '13 at 17:30
The beautiful part of JSoup is that you can use the selector syntax to do just that! Take a look at the link, it should provide plenty of details to get you further than my small example. – nicholas.hauschild Jan 08 '13 at 17:31

score 0 · Answer 2 · answered Jan 08 '13 at 16:52

Regex:

\<a\s+href\s*\=\s*["']/dotlrn/classes/c033.+\>(.*)\(\d+\)\</a\>

Java String:

"\\<a\\s+href\\s*\\=\\s*[\"']/dotlrn/classes/c033.+\\>(.*)\\(\\d+\\)\\</a\\>"

You probably won't find it reliable but the 1st matching group will be your desired string if the pages match what you supplied.

Here is a place to test Java regular expressions

score 0 · Answer 3 · answered Jan 08 '13 at 16:54

0

Why not use the DOM API? You can get attributes and values fairly trivially with it.

answered Jan 08 '13 at 16:54

ldam

4,412
6
45
76

score 0 · Answer 4 · answered Jan 08 '13 at 16:57

0

You can surely try using XML Pull Parsing or DOM, given that the input HTML is well formed.

answered Jan 08 '13 at 16:57

Waleed Almadanat

1,027
10
24

Parse the html code or use regex with java?

4 Answers4