0

I have tried solutions like using encode('ascii', errors='ignore') , but enable to remove these hex characters from string using python. here is my code..

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
import datetime
import time

options = Options()
options.add_argument("--disable-gpu")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome("C:/Webdriver/chromedriver.exe",options=options)
driver.get('https://www.trustradius.com/products/oracle-analytics-cloud/reviews?f=0&o=recent')
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'lxml')

scripts = soup.find_all('script')[-8].string
script = scripts.split('=',1)[1]
for item in script.split("\n"):
    if "searchData" in item:
        item = item.replace('searchData: ','')
        line = item[0:500]
        line = line.encode('ascii', errors='ignore').decode("utf-8")
        print(line)

please let me know if anyone have solution over this, thanks.

1 Answers1

0

Simple fix would be using the below code, if the only improper character is \x20 then replace it with a blank.

item = item.replace('\\x20','')

I found some answers helpful to similar question in Stackoverflow, please refer it. How to remove \xa0 from string in Python?

Hope this solution would fix your issue !!

LearnerLaksh
  • 60
  • 1
  • 6
  • using above solution i can replace one character but there are many such hex characters, how to replace them all with their alternative. – Nitesh Rao Jun 05 '21 at 13:59
  • @NiteshRao I have tried many ways like using unicodedata.normalize, re.sub , and using decode and encode seems none of them work. only thing that worked is replacing with a blank. I am curious to know the objective of this script, in the given website I found some content that contains text like \x20 is the issue related to encoding ? Idk – LearnerLaksh Jun 05 '21 at 17:53
  • @NiteshRao However I am giving some link that deals how to handle the issue please refer the below links https://nedbatchelder.com/text/unipain.html ,https://docs.python.org/3/library/unicodedata.html,https://www.programcreek.com/python/example/470/unicodedata.normalize – LearnerLaksh Jun 05 '21 at 17:54
  • I tried solutions given in above link, but no results. I tried making separate small code and used that item[0:500] string in variable and it is working, IDK why it is not working on above code. – Nitesh Rao Jun 06 '21 at 05:51