0

After scraping texts from various websites, I want to normalize this text in order to analyze it. One step I want to do is to replace multiple white spaces with a single white space.

I know this topic has been adressed frequently on Stack Overflow. However, using the common ways, such as:

string = ' '.join(string.split())

or

string = re.sub(' +', ' ', string)

appears not to yield the expected results for every webpage. Please find below an extract of the code I use and an example of a SEC filing, for which I do not manage not to have multiple white spaces.

import re
from selenium import webdriver

link = r"https://www.sec.gov/Archives/edgar/data/1800/000104746919001316/a2237648zdef14a.htm"
driver = webdriver.Chrome('./chromedriver')
driver.get(link)
x = driver.page_source

#Function to clean
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

#Cleaning
x = str(x).replace('<', ' <')
x = cleanhtml(x)
x = x.replace('<br>', ' ').replace('&nbsp;', ' ').replace('&amp;', '&').replace('/\s\s+/g',' ').replace('•', ' ').replace("&lt", " ").replace("_", " ").replace("●", " ")
x = ' '.join(x.split())

#Results with persist to have multiple white spaces :-(
print(x)

Note: I just edited my question, as my prior example was inappropiate! Thanks for Your answers so far!

  • 1
    _white space_ != _spacebar_ ... \t or \n and others are also whitespaces. Which do you mean? Your regex f.e. only eliminates consecutive _spacebar_ thingies... – Patrick Artner Aug 07 '20 at 08:14
  • 1
    have you tried ```re.sub('\s+', ' ', string)``` (s. comment by @PatrickArtner)? – mrxra Aug 07 '20 at 08:16
  • @PatrickArtner: Thanks. Basically I want to remove every whitespace, that is redundant for reading, i.e. there may not be anything like " " in my final string. – Michael Mü Aug 07 '20 at 08:17
  • 1
    Is there some other code that removes the   markers and the HTML elements? What exactly is the input and the expected output? – Roy2012 Aug 07 '20 at 08:24
  • @mrxra: It still won't work.. Ty though! – Michael Mü Aug 07 '20 at 08:52
  • @Roy2012: "replace(' ', ' ')" I am intentionally removing other HTMLs to spaces, prior to removing multiple white spaces. – Michael Mü Aug 07 '20 at 08:52
  • 1
    @MichaelMü can you give an example of the _actual_ string value you are trying to clean? the example given contains html tags (e.g. comments). also if you are using an html parser, i suppose i would already handle _ _ for you... – mrxra Aug 07 '20 at 08:57
  • @mrxra: Thanks, I just answered below, by providing a real example. – Michael Mü Aug 07 '20 at 10:08
  • 1
    @MichaelMü, what is the `cleanhtml()` function supposed to do, wiping out everything inside of the tags?? – Don Foumare Aug 07 '20 at 10:37
  • @DonFoumare: Yes, that's what it's intended to do :-) – Michael Mü Aug 07 '20 at 10:40
  • 1
    @MichaelMü, just for clarification: you basically want only the text without any structure whatsoever? – Don Foumare Aug 07 '20 at 10:49
  • @DonFoumare: Indeed, it should be a "row" of words/numbers with one single white space as seperator (of course things, such as points, commas, .. should be maintained). – Michael Mü Aug 07 '20 at 10:55
  • 2
    ...is there a particular reason you are not using an html parser (instead of using regex)? – mrxra Aug 07 '20 at 11:51

3 Answers3

2

updated due to changed problem description: you should use an html parser to handle tags and html entities. once you retrieve the text, remove unwanted characters such as ndash, bullet points, multiple whitespace characters:

import re
import bs4
from selenium import webdriver

link = r"https://www.sec.gov/Archives/edgar/data/1800/000104746919001316/a2237648zdef14a.htm"
driver = webdriver.Chrome('./chromedriver')
driver.get(link)
x = driver.page_source

soup = bs4.BeautifulSoup(x, 'html.parser')
text = soup.text

# you might also filter non-printable characters as explained here:
# https://stackoverflow.com/questions/92438/stripping-non-printable-characters-from-a-string-in-python
text = re.sub(r'[•●_—\u200B]+', ' ', text)

text = re.sub(r'\s+', ' ', text)
print(text)
mrxra
  • 852
  • 1
  • 6
  • 9
1

I would try to do something like that:

clean = ' '.join([word.strip() for word in not_clean.strip().split()])

That way you not only split at spaces, but you also cleanup every split word and the whole input.

edit: Since OP edited their question, this answer doesn't solve the problem anymore.

Jenny
  • 601
  • 3
  • 11
1
import re

sample = '''<font color="#952369" size="1"><b>


<!-- COMMAND= GRID_ADD,"background-color:#952369;" -->


 XXXXXXXXXXXXXXXXXXXXXXXXXXXXX&nbsp;&nbsp;</b></font>'''

def replace(match):
    return ''
    
sample = re.sub('\s+', replace, sample)

print(sample)
# Output:
# <fontcolor="#952369"size="1"><b><!--COMMAND=GRID_ADD,"background-color:#952369;"-->XXXXXXXXXXXXXXXXXXXXXXXXXXXXX&nbsp;&nbsp;</b></font>
Don Foumare
  • 444
  • 3
  • 13
  • 1
    .... ```print(re.sub('\s+', '', sample))``` generates the exact same output _without_ adding additional complexity by use of a function... – mrxra Aug 07 '20 at 10:10
  • @mrxra you are actually right, thanks for pointing that out! -.- I have to edit or delete my post anyways. – Don Foumare Aug 07 '20 at 10:16