Python: Multiple white spaces remain after replacement?

Question

After scraping texts from various websites, I want to normalize this text in order to analyze it. One step I want to do is to replace multiple white spaces with a single white space.

I know this topic has been adressed frequently on Stack Overflow. However, using the common ways, such as:

string = ' '.join(string.split())

or

string = re.sub(' +', ' ', string)

appears not to yield the expected results for every webpage. Please find below an extract of the code I use and an example of a SEC filing, for which I do not manage not to have multiple white spaces.

import re
from selenium import webdriver

link = r"https://www.sec.gov/Archives/edgar/data/1800/000104746919001316/a2237648zdef14a.htm"
driver = webdriver.Chrome('./chromedriver')
driver.get(link)
x = driver.page_source

#Function to clean
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

#Cleaning
x = str(x).replace('<', ' <')
x = cleanhtml(x)
x = x.replace('<br>', ' ').replace('&nbsp;', ' ').replace('&amp;', '&').replace('/\s\s+/g',' ').replace('•', ' ').replace("&lt", " ").replace("_", " ").replace("●", " ")
x = ' '.join(x.split())

#Results with persist to have multiple white spaces :-(
print(x)

Note: I just edited my question, as my prior example was inappropiate! Thanks for Your answers so far!

_white space_ != _spacebar_ ... \t or \n and others are also whitespaces. Which do you mean? Your regex f.e. only eliminates consecutive _spacebar_ thingies... — Patrick Artner, Aug 07 '20 at 08:14
have you tried ```re.sub('\s+', ' ', string)``` (s. comment by @PatrickArtner)? — mrxra, Aug 07 '20 at 08:16
@PatrickArtner: Thanks. Basically I want to remove every whitespace, that is redundant for reading, i.e. there may not be anything like " " in my final string. — Michael Mü, Aug 07 '20 at 08:17
Is there some other code that removes the markers and the HTML elements? What exactly is the input and the expected output? — Roy2012, Aug 07 '20 at 08:24
@Roy2012: "replace(' ', ' ')" I am intentionally removing other HTMLs to spaces, prior to removing multiple white spaces. — Michael Mü, Aug 07 '20 at 08:52
@MichaelMü can you give an example of the _actual_ string value you are trying to clean? the example given contains html tags (e.g. comments). also if you are using an html parser, i suppose i would already handle _ _ for you... — mrxra, Aug 07 '20 at 08:57
@mrxra: Thanks, I just answered below, by providing a real example. — Michael Mü, Aug 07 '20 at 10:08
@MichaelMü, what is the `cleanhtml()` function supposed to do, wiping out everything inside of the tags?? — Don Foumare, Aug 07 '20 at 10:37
@MichaelMü, just for clarification: you basically want only the text without any structure whatsoever? — Don Foumare, Aug 07 '20 at 10:49
@DonFoumare: Indeed, it should be a "row" of words/numbers with one single white space as seperator (of course things, such as points, commas, .. should be maintained). — Michael Mü, Aug 07 '20 at 10:55
...is there a particular reason you are not using an html parser (instead of using regex)? — mrxra, Aug 07 '20 at 11:51

mrxra · Accepted Answer · 2020-08-07T13:16:24.287

updated due to changed problem description: you should use an html parser to handle tags and html entities. once you retrieve the text, remove unwanted characters such as ndash, bullet points, multiple whitespace characters:

import re
import bs4
from selenium import webdriver

link = r"https://www.sec.gov/Archives/edgar/data/1800/000104746919001316/a2237648zdef14a.htm"
driver = webdriver.Chrome('./chromedriver')
driver.get(link)
x = driver.page_source

soup = bs4.BeautifulSoup(x, 'html.parser')
text = soup.text

# you might also filter non-printable characters as explained here:
# https://stackoverflow.com/questions/92438/stripping-non-printable-characters-from-a-string-in-python
text = re.sub(r'[•●_—\u200B]+', ' ', text)

text = re.sub(r'\s+', ' ', text)
print(text)

Thanks a lot - that is exactly what I needed! – Michael Mü Aug 07 '20 at 19:33 — Michael Mü, Aug 07 '20 at 19:33

Jenny · Answer 2 · 2020-08-07T11:30:40.393

1

I would try to do something like that:

clean = ' '.join([word.strip() for word in not_clean.strip().split()])

That way you not only split at spaces, but you also cleanup every split word and the whole input.

edit: Since OP edited their question, this answer doesn't solve the problem anymore.

edited Aug 07 '20 at 11:30

answered Aug 07 '20 at 08:24

Jenny

601
3
11

Don Foumare · Answer 3 · 2020-08-07T20:55:10.567

1

import re

sample = '''<font color="#952369" size="1"><b>


<!-- COMMAND= GRID_ADD,"background-color:#952369;" -->


 XXXXXXXXXXXXXXXXXXXXXXXXXXXXX&nbsp;&nbsp;</b></font>'''

def replace(match):
    return ''
    
sample = re.sub('\s+', replace, sample)

print(sample)
# Output:
# <fontcolor="#952369"size="1"><b><!--COMMAND=GRID_ADD,"background-color:#952369;"-->XXXXXXXXXXXXXXXXXXXXXXXXXXXXX&nbsp;&nbsp;</b></font>

edited Aug 07 '20 at 20:55

answered Aug 07 '20 at 08:56

Don Foumare

444
3
13

1

.... ```print(re.sub('\s+', '', sample))``` generates the exact same output _without_ adding additional complexity by use of a function... – mrxra Aug 07 '20 at 10:10
@mrxra you are actually right, thanks for pointing that out! -.- I have to edit or delete my post anyways. – Don Foumare Aug 07 '20 at 10:16

Python: Multiple white spaces remain after replacement?

3 Answers3