I have a list of URL's from which I need to scrape the data using Python.I am using the below code for scraping the data
def extract_url_data1(url):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = " ".join(chunk for chunk in chunks if chunk)
return str(text.encode('utf-8'))
I am storing the returned data in a text file.The issue I am facing is that some urls return the data in the form "xbd5\xef\xbf\xbdFDK\xef\xbf\xbdCP\xef\xbf\xbdHP\xef\xbf\xbd\xef\xbf\xbd6N" .I want only proper english words to be stored in the text file. Please advise how I can achieve the same as I have already tried some regular expressions such as below
re.sub(r'[^\x00-\x7f]',r' ',text)