0

I have the following code that outputs data extracted from <div> tag.

s = BeautifulSoup(driver.page_source, "lxml")

best_price_tags = s.findAll('div', "flt-subhead1 gws-flights-results__price gws-flights-results__cheapest-price")
best_prices = []
for tag in best_price_tags:
    best_prices.append(tag.text.replace('€', '').strip())

The first element of the variable best_price_tags contains the following:

<div class="flt-subhead1 gws-flights-results__price gws-flights-results__cheapest-price">      1 820 €   </div>

I am expecting from the above code to output only the value 1821.

The above code chunk has a problem where it outputs the following, consider the case of best_price_tags[0], '1\u202f821'.

I tried the following but unfortunately did not work for me.

for tag in best_price_tags:
    best_prices.append(int(tag.text.replace('€', '').strip()))

Looking for an automated solution without using NLP modules.

NOTE: I have edited the exact value <div> tag has. It was <div class='...'>1 820 €</div> and now it is <div class='...'> 1 820 € </div>.

Joe
  • 575
  • 6
  • 24

1 Answers1

1

the space in 1 821 seems to be a no-break space (causing the \u202f in the output), try doing a replace on this too. By the way, I don't know where this character is on a keyboard, but copy/paste should be enough

malmiteria
  • 119
  • 2
  • 8
  • Thanks, @malmiteria. Your idea works but I have to write all this line after the `for` loop, `best_prices.append(tag.text.replace('\xa0€', '').replace('\u202f', '').strip())` the think which I consider not a pythonic way, isn't it? – Joe Feb 14 '20 at 21:34
  • Well I don't know for sure, but I don't know how to do it otherwise, there is a discussion about it there https://stackoverflow.com/questions/6116978/how-to-replace-multiple-substrings-of-a-string – malmiteria Feb 14 '20 at 21:40
  • check my updated post concerning the exactly value `
    ` tag has.
    – Joe Feb 14 '20 at 21:41
  • Still needs to be checked if we can remove all spaces in the value that `
    ` tag has, which then it remains only to write `int(tag.text.replace('€', ''))`.
    – Joe Feb 14 '20 at 21:50
  • doesn't `.strip()` remove all the extra whitespace? – malmiteria Feb 14 '20 at 21:51
  • 1
    Yes, it does but not what in between the number. It removes spaces which are in the right and left of 1 820 €. – Joe Feb 14 '20 at 22:08