Removing UTF Data of the form "xbd5\xef\xbf\xbdFDK\xef\xbf\xbdCP\xef\xbf\xbdHP\xef\xbf\xbd\xef\xbf\xbd6N" in Python

Question

I have a list of URL's from which I need to scrape the data using Python.I am using the below code for scraping the data

def extract_url_data1(url):
   html = urllib.request.urlopen(url).read()
   soup = BeautifulSoup(html)
   for script in soup(["script", "style"]):
    script.extract()
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = " ".join(chunk for chunk in chunks if chunk)
    return str(text.encode('utf-8'))

I am storing the returned data in a text file.The issue I am facing is that some urls return the data in the form "xbd5\xef\xbf\xbdFDK\xef\xbf\xbdCP\xef\xbf\xbdHP\xef\xbf\xbd\xef\xbf\xbd6N" .I want only proper english words to be stored in the text file. Please advise how I can achieve the same as I have already tried some regular expressions such as below

re.sub(r'[^\x00-\x7f]',r' ',text)

score 0 · Answer 1 · edited May 23 '17 at 12:28

If you want to remove non english letters, then there you go:

In [1]: import re

In [2]: s = "xbd5\xef\xbf\xbdFDK\xef\xbf\xbdCP\xef\xbf\xbdHP\xef\xbf\xbd\xef\xbf\xbd6N"

In [3]: ' '.join(re.findall(r'\w+', s))
Out[3]: 'xbd5 FDK CP HP 6N'

However, if you want to keep only valid english words, then you'll need to validate them. This How to check if a word is an English word with Python? will help you.

Removing UTF Data of the form "xbd5\xef\xbf\xbdFDK\xef\xbf\xbdCP\xef\xbf\xbdHP\xef\xbf\xbd\xef\xbf\xbd6N" in Python

1 Answers1