69

I have this HTML:

 <tr class="even  expanded first>
   <td class="score-time status">
     <a href="/matches/2012/08/02/europe/uefa-cup/">

            16 : 00

     </a>
    </td>        
  </tr>

I want to extract the (16 : 00) string without the extra whitespace. Is this possible?

james.garriss
  • 12,959
  • 7
  • 83
  • 96
adellam
  • 811
  • 2
  • 9
  • 16
  • 3
    Using what implementation - PHP, or what? XPath is concerned with the retrieval of nodes, not string handling. Any removal of whitespace would need to be done separately *after* retrieval. – Mitya Aug 02 '12 at 12:04
  • i think there is an expression to get the desired text without spaces – adellam Aug 02 '12 at 12:06
  • If we're talking about php (which I've somehow assumed since it's about html), you can set preseveWhiteSpace to false on you DOMDocument object, resulting in the automatic removal of redundant white space. http://www.php.net/manual/de/class.domdocument.php#domdocument.props.preservewhitespace – inVader Aug 02 '12 at 12:12
  • 1
    As I say, XPath is not a string-handling mechanism; it cannot remove spaces. It is concerned solely with the retrieval of data. Anything you want to do TO that data must be done separately, and currently we don't know what language you're using to do that in. – Mitya Aug 02 '12 at 12:12
  • 3
    @Utkanos: the absolute statement about the string-handling capabilities of XPath is proven wrong -- by my answer. :) – Dimitre Novatchev Aug 03 '12 at 04:31
  • @adellam: There is no need to use any additional PHP functions, such as `trim()` -- the wanted string can be produced by evaluating a single, short XPath expression. – Dimitre Novatchev Aug 03 '12 at 04:33

5 Answers5

150

I. Use this single XPath expression:

translate(normalize-space(/tr/td/a), ' ', '')

Explanation:

  1. normalize-space() produces a new string from its argument, in which any leading or trailing white-space (space, tab, NL or CR characters) is deleted and any intermediary white-space is replaced by a single space character.

  2. translate() takes the result produced by normalize-space() and produces a new string in which each of the remaining intermediary spaces is replaced by the empty string.


II. Alternatively:

translate(/tr/td/a, ' &#9;&#10;&#13', '')
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • Is there a shortest XPATH expression to get only the CDATA nodes though an XML file ? – Arup Rakshit Aug 02 '14 at 11:48
  • 1
    @ArupRakshit, There are no "CDATA nodes" in the XPath Data Model and thus it is *not* possible to distinguish CDATA as part of the text node that contains it. The same way as it is not possible to know if the short tag was used for an element without children, or if quotes or apostrophes were used as delimiters around an attribute value. – Dimitre Novatchev Aug 02 '14 at 15:00
  • @DimitreNovatchev Thanks for the reply. So it means, I need to find it , they way, I search for the regular nodes. – Arup Rakshit Aug 02 '14 at 16:33
  • @ArupRakshit, Yes, one can only select *whole* text nodes in XPath. You could filter these nodes with predicate(s) if you know something more (like a substring) for the text you are looking for – Dimitre Novatchev Aug 02 '14 at 16:56
29

Please try the below xpath expression :

//td[@class='score-time status']/a[normalize-space() = '16 : 00']
Rob
  • 26,989
  • 16
  • 82
  • 98
Eby
  • 386
  • 3
  • 5
9

You can use XPath's normalize-space() as in //a[normalize-space()="16 : 00"]

Udhav Sarvaiya
  • 9,380
  • 13
  • 53
  • 64
3

I came across this thread when I was having my own issue similar to above.

HTML

<div class="d-flex">
<h4 class="flex-auto min-width-0 pr-2 pb-1 commit-title">
  <a href="/nsomar/OAStackView/releases/tag/1.0.1">

    1.0.1
  </a>

XPath start command

tree.xpath('//div[@class="d-flex"]/h4/a/text()')

However this grabbed random whitespace and gave me the output of:

['\n          ', '\n        1.0.1\n      ']

Using normalize-space, it removed the first blank space node and left me with just what I wanted

tree.xpath('//div[@class="d-flex"]/h4/a/text()[normalize-space()]')

['\n        1.0.1\n      ']

I could then grab the first element of the list, and use strip() to remove any further whitespace

XPath final command

tree.xpath('//div[@class="d-flex"]/h4/a/text()[normalize-space()]')[0].strip()

Which left me with exactly what I required:

1.0.1
jerrythebum
  • 330
  • 1
  • 6
  • 17
1
  • you can check if text() nodes are empty.

    /path/text()[not(.='')]

it may be useful with axes like following-sibling:: if these are no containers, or with child::.

  • you can use string() or the regex() function of xpath 2.

NOTE: some comments say that xpath cannot do string manipulation... even if it's not really designed for that you can do basic things: contains(), starts-with(), replace().

if you want to check whitespace nodes it's much harder, as you will generally have a nodelist result set, and most xpath functions, like match or replace, only operate one node.

  • you can separate node and string manipulation

So you may use xpath to retrieve a container, or a list of text nodes, and then process it with another language. (java, php, python, perl for instance).

Chris Noe
  • 36,411
  • 22
  • 71
  • 92
N4553R
  • 188
  • 4