0

I have this text in html page

<div class="phone-content">

                            ‪050 2836142‪

                    </div>

I extract it like this:

I am using xpath to extract the value inside that div live this

normalize-space(.//div[@class='fieldset-content']/span[@class='listing-reply-phone']/div[@class='phone-content']/text())

I got this result:

"\u202a050 2836142\u202a"

anyone knows who to tell the xpath in python to remove that unicode chars?

Marco Dinatsoli
  • 10,322
  • 37
  • 139
  • 253
  • If it's only numbers, you can convert to ASCII. See this: http://stackoverflow.com/questions/1207457/convert-unicode-to-a-string-in-python-containing-extra-symbols – helderdarocha Feb 23 '14 at 00:13

1 Answers1

1

If you're looking for an XPath solution: to remove all characters but those from a given set, you can use two nested translate(...) calls following this pattern:

translate($string, translate($string, ' 0123456789', ''), '')

This will remove all characters that are not the space character or a digit. You will have to replace both occurrences of $string by the complete XPath expression to fetch that string.

It might be more reasonable though to apply that outside XPath using more advanced string manipulation features. Those of XPath 1.0 are very limited.

Jens Erat
  • 37,523
  • 16
  • 80
  • 96