2

I'm trying to extract the values of this piece of html code:

<ul id="tree-dotlrn_class_instance">
<li>
      <a href="/dotlrn/classes/c033/13000/c12c033a13000gA/">**2011-12 Ampl.Arquit.Computadors Gr.A  (13000)**</a>
<ul>
    <li>
        <a href="/dotlrn/classes/c033/13022/c12c033a13022gA/c12c033a13022gAsT00/">**2011-12 Entorns d'Usuari Gr.A  Sgr.T00 (13022)** </a>
    </li>
    <li>
      <a href="/dotlrn/classes/c033/13036/c12c033a13036gA/c12c033a13036gAsT00/">**2011-12 Eng.Serv.Telemàtics Gr.A  Sgr.T00 (13036)** </a>
    </li>
</ul>
</li>

<li>
      <a href="/dotlrn/classes/c033/13038/c12c033a13038gA/">**2011-12 Intel·lig.Artif.Enginyer.Coneixem. Gr.A  (13038)**</a>
</li>

<li>
      <a href="/dotlrn/classes/c033/13048/c12c033a13048gA/">**2011-12 Processad.Llenguatge Gr.A  (13048)**</a>
<ul>
    <li>
        <a href="/dotlrn/classes/c033/13048/c12c033a13048gA/c12c033a13048gAsL01/">**2011-12 Processad.Llenguatge Gr.A  Sgr.L01 (13048)** </a>
    </li>
    <li>
      <a href="/dotlrn/classes/c033/13048/c12c033a13048gA/c12c033a13048gAsT00/">**2011-12 Processad.Llenguatge Gr.A  Sgr.T00 (13048)** </a>
    </li>
    <li>
      <a href="/dotlrn/classes/c033/13052/c12c033a13052gA/c12c033a13052gAsL02/">**2011-12 Sist.Basats Microprocessadors Gr.A  Sgr.L02 (13052)** </a>
    </li>
</ul>
</li>

<li>
      <a href="/dotlrn/classes/c033/13055/c12c033a13055gAA/">**2011-12 Sist.Informàtics Gr.AA (13055)**</a>
</li>

<li>
      <a href="/dotlrn/classes/c033/14009/c12c033a14009gA/">**2011-12 Administrac. Gestió de Xarxes Gr.A  (14009)**</a>
</li>

<li>
      <a href="/dotlrn/classes/c033/15656/c12c033a15656gA/">**2011-12 Transmissió de Dades Gr.A**  (15656)</a>        
</li>
</ul>

All that it's in strong black (between**)with his href value into a HashMap. First I try with jericho html parser but I think is so complicated, then I try with Regex, but I don't know how to do it exactly. Can you help me ??

Thanks!

Update: I'm trying this, but it's not the right way.

Source s = new Source(answer);
    List<Element> Form1 = s.getAllElements(HTMLElementName.UL);
    int tam1 = Form1.size();
        for(int j = 0; j < tam1; j++){
            Element e1 = Form1.get(j);
            if("tree-dotlrn_class_instance".equals(e1.getAttributeValue("id"))){
                List<Element> L1 = e1.getAllElements(HTMLElementName.UL);
                for (int k = 0; k < L1.size(); k++){
                    Element e2 = L1.get(k);
                    System.out.println("Elemento de la lista L1: "+e2.getContent());
                    List<Element> L2 = e2.getAllElements(HTMLElementName.LI);
                    for(int m = 0; m < L2.size(); m++){
                        Element e3 = L2.get(m);
                        System.out.println("Elemento de la lista L2: "+e3.getContent());
                        asignaturas.add(e3.getContent().toString());
                        System.out.println("Lista de asignaturas "+m+" "+asignaturas.get(0));
                    }
                }

            }
        }
  • 7
    [Never parse HTML/XML with regexes](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) – m0skit0 Jan 08 '13 at 16:41
  • There's nothing in **strong black** in your `**code block**`. Next, HTML is not [Regular](http://en.wikipedia.org/wiki/Regular_language) so you can't use *Regular* Expressions to parse it reliably. – Richard JP Le Guen Jan 08 '13 at 16:42
  • 1
    Just read @m0skit0 's link. But there is a valid case which is very narrow. That's the case where the HTML is generated on-the-fly by some application and you either own the application or otherwise know when it changes what it generates (or only need it for the week and assume it won't change this week). Then you can parse what will be well-formed HTML. Its a pretty selective case but the example HTML seems to fall into this case. It just depends. – Lee Meador Jan 08 '13 at 17:06
  • There is nothing in **strong black** because it's code and then this property doesn't' apply, but I put the ** in the part of the code I want to extract. Ok, now I decided to parse with a HTML parser, how can I do, because I want all the href with the text that it's associated in the browser view. – Carlos del Blanco Jan 08 '13 at 17:26
  • @LeeMeador and you will keep updating the Java app each time the output of the other app changes? Poor foresight IMHO. Do it right the first time and you're done. – m0skit0 Jan 08 '13 at 22:02
  • Fair enough. I think I qualified my description of when it might be useful. Its pretty limited. If it changes much, as you point out, a regex would sign you up for a lot of ongoing work. – Lee Meador Jan 08 '13 at 22:05
  • See http://stackoverflow.com/a/1732454/218454 – nfechner Feb 05 '13 at 14:50

4 Answers4

5

Take a look at JSoup's selector syntax.

If you are looking for all a elements with an href attribute, you can find them like this:

String theHtmlInYourExample = "...";
Document doc = Jsoup.parse(theHtmlInYourExample);
Elements links = doc.select("a[href]");

From there, you should be able to extract the text of the element and the value of the href attribute to create your HashMap.

nicholas.hauschild
  • 42,483
  • 9
  • 127
  • 120
0

Regex:

\<a\s+href\s*\=\s*["']/dotlrn/classes/c033.+\>(.*)\(\d+\)\</a\>

Java String:

"\\<a\\s+href\\s*\\=\\s*[\"']/dotlrn/classes/c033.+\\>(.*)\\(\\d+\\)\\</a\\>"

You probably won't find it reliable but the 1st matching group will be your desired string if the pages match what you supplied.

Here is a place to test Java regular expressions

Lee Meador
  • 12,829
  • 2
  • 36
  • 42
0

Why not use the DOM API? You can get attributes and values fairly trivially with it.

ldam
  • 4,412
  • 6
  • 45
  • 76
0

You can surely try using XML Pull Parsing or DOM, given that the input HTML is well formed.

Waleed Almadanat
  • 1,027
  • 10
  • 24