what is the way to strip something from text() using xpath?

Question

i am using an xpath in python to parse a table from an html file. i am using this xpath :

//td//text()

This give me output as two strings:

['australia', '$3333.99']

output i want:

['australia', '3333.99']

but i want $ sign to be stripped of how can i do that in general using xpath? i have tried substring-after but it does not works.

this is how i tried :

//td//text()[substring-after(.,'$')]

but i got this output:

['$3333.99']

Australia was missing from the result

The expression you tried is fine in XPath 2.0 but not in XPath 1.0. You should specify which XPath version you are using. Though it doesn't do quite what you want: try `//td//text()/substring-after(.,'$')` — Michael Kay, Feb 03 '17 at 18:17

score 2 · Accepted Answer · edited May 23 '17 at 12:01

Aside from using translate() (as posted in the other answer), you can also use substring() function and dynamically determine the beginning of a slice:

In [4]: [item.xpath("substring(., starts-with(., '$') + 1)") for item in root.xpath("//td")]
Out[4]: ['australia', '3333.99']

By the way, this approach is a bit safer than using translate() since, here we are only stripping a single $ character at the beginning of a string if it exists, but translate() would replace all the occurrences of $ in every td text you are extracting. You may get some unwanted side effects.

Note that you have to do it in two steps in any case - the translate() or substring() functions would not be applied to every node if used like translate(//td//text(), "$", ""), references:

Or, you can trim it using Python and .lstrip():

[item.lstrip("$") for item in root.xpath("//td//text()")]

i am aware of this but i dont want to loop over list because i have more than 500 of them , it makes the function slow . i was looking for a way using xpath mayb using `translate` — anekix, Feb 03 '17 at 16:35
@anekix saw the translate variant and decided to post an alternative approach, check it out — alecxe, Feb 03 '17 at 16:59
@anekix also, linked a discussion about why you cannot do it in one go and have to have an extra loop. — alecxe, Feb 03 '17 at 17:03

宏杰李 · Answer 2 · 2017-02-03T16:43:59.490

0

//td//text()[substring-after(.,'$')]

This will evaluate the text() in ['australia', '$3333.99'], and for the australia, it dose not contains $, this will return false and will not show in the result

[td.xpath('translate(., "$", "")')for td in tree.xpath("//td")]

edited Feb 03 '17 at 16:43

answered Feb 03 '17 at 16:36

宏杰李

11,820
2
28
35

@anekix xpath is used to locate tag, not modify the tag. yes, xpath path can do the this task, but python `strip` is better choice. – 宏杰李 Feb 03 '17 at 16:41
i have like 500 lists to apply lstrip on its not the big deal i know but i have to do thi for 400 tables so now its 400X500 its a costly loop i think and it slows down my application – anekix Feb 03 '17 at 16:43
isn't it same ? i meant i have to do a seperate iteration over the list right? – anekix Feb 03 '17 at 16:48
@anekix yes, it's inevitable – 宏杰李 Feb 03 '17 at 16:50
@anekix I think you should get rid of all `$` in your html file before you use xpath, this is much efficient – 宏杰李 Feb 03 '17 at 16:52

what is the way to strip something from text() using xpath?

2 Answers2