1

I have this html node

<li>
    <em>Description
    </em>
    <br>
    TEXT TEXT                
</li>

I want to extract the Text Text

I tried this:

 sel.xpath('//em[normalize-space(.) = "Description"]/following-sibling::*')

I got empty result.

Why please?

please I need to check for description, so i hope your answers don't include chaning the checking with description

Marco Dinatsoli
  • 10,322
  • 37
  • 139
  • 253

2 Answers2

3

I found the solution myself and it is

'//li[contains(em,"Description")]/text()[last()]'
Marco Dinatsoli
  • 10,322
  • 37
  • 139
  • 253
0

That's not valid XML. Where does the <br> close? If it's <br/> then it will be empty because the following sibling is the <br/>.

helderdarocha
  • 23,209
  • 4
  • 50
  • 65
  • this is the html that i got from the webpage. i can't change it and even if i could, i don't want. i want to extract the data from the website, note fixing their html :) – Marco Dinatsoli Feb 06 '14 at 22:55
  • If the HTML is not well-formed XML it won't parse and you won't be able to use XPath. You'll have to use something else. Are you sure it's not a
    instead of
    ?
    – helderdarocha Feb 06 '14 at 22:57
  • If it is not well-formed, and if you can't fix it, you can try to convert HTML into well-formed XHTML using [JTidy](http://jtidy.sourceforge.net/) first. After that you can use XPath. – helderdarocha Feb 06 '14 at 22:59
  • yes it is
    and you should know that every problem has a solution, so please don't tell that `it won't parse` because it will be parsed :P
    – Marco Dinatsoli Feb 06 '14 at 23:00
  • lets wait for an answer that doesn't use `xhtml`. i have a lot of pages and my system will have performance problem if i covered my html for every item. – Marco Dinatsoli Feb 06 '14 at 23:01
  • Well, supposing something there closes the
    for you to make it work with XPath, then you have to select the node after it: `//em[normalize-space(.) = "Description"]/following-sibling::node()[3]` since `node[2]` is the `
    `
    – helderdarocha Feb 06 '14 at 23:08
  • I found the solution and I wrote it in an answer – Marco Dinatsoli Feb 06 '14 at 23:13