-2

So I am scraping this website with link : https://www.americanexpress.com/in/credit-cards/payback-card/ using beautiful soup and python.

link = 'https://www.americanexpress.com/in/credit-cards/payback-card/'
html = urlopen(link)
soup = BeautifulSoup(html, 'lxml')

details = []

for span in soup.select(".why-amex__subtitle span"):
    details.append(f'{span.get_text(strip=True)}: {span.find_next("span").get_text(strip=True)}')

print(details)

Output:

['EARN POINTS: Earn multiple Points from more than 50 PAYBACK partners2and 2 PAYBACK Points from American\xa0Express PAYBACK Credit\xa0Card for every Rs.\xa0100 spent', 'WELCOME GIFT: Get Flipkart voucher worth Rs. 7503on taking 3 transactions within 60 days of Cardmembership', 'MILESTONE BENEFITS: Flipkart vouchers4worth Rs. 7,000 on spending Rs. 2.5 lacs in a Cardmembership yearYou will earn a Flipkart voucher4worth Rs. 2,000 on spending Rs. 1.25 lacs in a Cardmembership year. Additionally, you will earn a Flipkart voucher4worth Rs. 5,000 on spending Rs. 2.5 lacs in a Cardmembership year.']

As you can see in the output there are \xa0 characters that are to be eliminated from the string.

I tried to use replace function, but it isn't working out with the f string, since there is \ involved.

details.append(f'{span.get_text(strip=True)}: {span.find_next("span").get_text(strip=True).replace("\xa0","")}')

Is there any alternative to go about this ?

Any help is highly appreciated !!!

  • this is what you are looking for: https://stackoverflow.com/questions/10993612/how-to-remove-xa0-from-string-in-python – smitkpatel Feb 15 '21 at 17:16
  • Reopening. The supposed duplicate does not work inside an f-string and does not address f-strings. – user2357112 Feb 15 '21 at 17:28
  • @ smitpatel No it does not answer my question, I was seeking a solution with the existing code using f string. –  Feb 15 '21 at 17:30

2 Answers2

0

this can be a temporary solution since .replace("\xa0","") not working inside make changes outside before:

link = 'https://www.americanexpress.com/in/credit-cards/payback-card/'
html = urlopen(link)
soup = BeautifulSoup(html, 'lxml')

details = []

for span in soup.select(".why-amex__subtitle span"):

    element = span.get_text(strip=True).replace("\xa0","")
    next_element = span.find_next("span").get_text(strip=True).replace("\xa0","")
    details.append(f'{element}: {next_element}')

print(details)
Utpal Dutt
  • 383
  • 3
  • 18
0

You can use unicodedata to remove the \xa0 characters. It will not run when inluded in the f strings, but this will do:

from bs4 import BeautifulSoup
from unicodedata import normalize

link = 'https://www.americanexpress.com/in/credit-cards/payback-card/'
html = urlopen(link)
soup = BeautifulSoup(html, 'lxml')

details = []

for span in soup.select(".why-amex__subtitle span"):
    a = normalize('NFKD', span.get_text(strip=True))
    b = normalize('NFKD',span.find_next("span").get_text(strip=True))
    details.append(f'{a}: {b}')

print(details)
RJ Adriaansen
  • 9,131
  • 2
  • 12
  • 26