Parsing the DOM to extract data using Python

Question

I have the following code that outputs data extracted from <div> tag.

s = BeautifulSoup(driver.page_source, "lxml")

best_price_tags = s.findAll('div', "flt-subhead1 gws-flights-results__price gws-flights-results__cheapest-price")
best_prices = []
for tag in best_price_tags:
    best_prices.append(tag.text.replace('€', '').strip())

The first element of the variable best_price_tags contains the following:

<div class="flt-subhead1 gws-flights-results__price gws-flights-results__cheapest-price">      1 820 €   </div>

I am expecting from the above code to output only the value 1821.

The above code chunk has a problem where it outputs the following, consider the case of best_price_tags[0], '1\u202f821'.

I tried the following but unfortunately did not work for me.

for tag in best_price_tags:
    best_prices.append(int(tag.text.replace('€', '').strip()))

Looking for an automated solution without using NLP modules.

NOTE: I have edited the exact value <div> tag has. It was <div class='...'>1 820 €</div> and now it is <div class='...'> 1 820 € </div>.

\u202f is unicode for the small space between the 1 and the 8, are you using python 3? — Roy Zwambag, Feb 14 '20 at 21:24
Thanks @RoyZwambag, I am using Python 3 in Jupyter notebook. — Joe, Feb 14 '20 at 21:26
https://www.fileformat.info/info/unicode/char/202f/index.htm — Peter Wood, Feb 14 '20 at 21:29

score 1 · Accepted Answer · answered Feb 14 '20 at 21:24

1

the space in 1 821 seems to be a no-break space (causing the \u202f in the output), try doing a replace on this too. By the way, I don't know where this character is on a keyboard, but copy/paste should be enough

answered Feb 14 '20 at 21:24

malmiteria

119
2
8

Thanks, @malmiteria. Your idea works but I have to write all this line after the `for` loop, `best_prices.append(tag.text.replace('\xa0€', '').replace('\u202f', '').strip())` the think which I consider not a pythonic way, isn't it? – Joe Feb 14 '20 at 21:34
Well I don't know for sure, but I don't know how to do it otherwise, there is a discussion about it there https://stackoverflow.com/questions/6116978/how-to-replace-multiple-substrings-of-a-string – malmiteria Feb 14 '20 at 21:40
check my updated post concerning the exactly value `
` tag has.
– Joe Feb 14 '20 at 21:41
Still needs to be checked if we can remove all spaces in the value that `
` tag has, which then it remains only to write `int(tag.text.replace('€', ''))`.
– Joe Feb 14 '20 at 21:50
doesn't `.strip()` remove all the extra whitespace? – malmiteria Feb 14 '20 at 21:51
1

Yes, it does but not what in between the number. It removes spaces which are in the right and left of 1 820 €. – Joe Feb 14 '20 at 22:08

Parsing the DOM to extract data using Python

1 Answers1