After scraping texts from various websites, I want to normalize this text in order to analyze it. One step I want to do is to replace multiple white spaces with a single white space.
I know this topic has been adressed frequently on Stack Overflow. However, using the common ways, such as:
string = ' '.join(string.split())
or
string = re.sub(' +', ' ', string)
appears not to yield the expected results for every webpage. Please find below an extract of the code I use and an example of a SEC filing, for which I do not manage not to have multiple white spaces.
import re
from selenium import webdriver
link = r"https://www.sec.gov/Archives/edgar/data/1800/000104746919001316/a2237648zdef14a.htm"
driver = webdriver.Chrome('./chromedriver')
driver.get(link)
x = driver.page_source
#Function to clean
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
#Cleaning
x = str(x).replace('<', ' <')
x = cleanhtml(x)
x = x.replace('<br>', ' ').replace(' ', ' ').replace('&', '&').replace('/\s\s+/g',' ').replace('•', ' ').replace("<", " ").replace("_", " ").replace("●", " ")
x = ' '.join(x.split())
#Results with persist to have multiple white spaces :-(
print(x)
Note: I just edited my question, as my prior example was inappropiate! Thanks for Your answers so far!