14

I'm building a data extract using scrapy and want to normalize a raw string pulled out of an HTML document. Here's an example string:

  Sapphire RX460 OC  2/4GB

Notice two groups of two whitespaces preceeding the string literal and between OC and 2.

Python provides trim as described in How do I trim whitespace with Python? But that won't handle the two spaces between OC and 2, which I need collapsed into a single space.

I've tried using normalize-space() from XPath while extracting data with my scrapy Selector and that works but the assignment verbose with strong rightward drift:

product_title = product.css('h3').xpath('normalize-space((text()))').extract_first()

Is there an elegant way to normalize whitespace using Python? If not a one-liner, is there a way I can break the above line into something easier to read without throwing an indentation error, e.g.

product_title = product.css('h3')
    .xpath('normalize-space((text()))')
    .extract_first()
vhs
  • 9,316
  • 3
  • 66
  • 70
  • In reply to your secondary question, about how to format the Python code without throwing an indentation error. You can break a Python statement across multiple lines by using parentheses. – Christian Long Jul 28 '23 at 17:39
  • 1
    Here's an [example](https://stackoverflow.com/a/76790005/456550) of how to use parentheses to format Python code across several lines. – Christian Long Jul 28 '23 at 17:46

4 Answers4

36

You can use:

" ".join(s.split())

where s is your string.

Tom Karzes
  • 22,815
  • 2
  • 22
  • 41
4

Instead of using regex's for this, a more efficient solution is to use the join/split option, observe:

>>> timeit.Timer((lambda:' '.join(' Sapphire RX460 OC  2/4GB'.split()))).timeit()
0.7263979911804199

>>> def f():
        return re.sub(" +", ' ', "  Sapphire RX460 OC  2/4GB").split()

>>> timeit.Timer(f).timeit()
4.163465976715088
hd1
  • 33,938
  • 5
  • 80
  • 91
1

The accepted answer is the right way to normalize the whitespace. This is an answer to your secondary question about formatting.

You also asked about how to format Python code across multiple lines without throwing an indentation error. You can do that in Python using parentheses. Here's what the example code from your question would look like formatted across several lines for readability.

product_title = (
    product.css("h3")
    .xpath("normalize-space((text()))")
    .extract_first()
)

Note that these parentheses don't create a tuple because there is no comma. The outer parentheses are just for formatting purposes.

The multi-line code above is exactly equivalent to chaining all the method calls together on one line.

product_title = product.css("h3").xpath("normalize-space((text()))").extract_first()
Christian Long
  • 10,385
  • 6
  • 60
  • 58
0

You can use a function like below with regular expression to scan for continuous spaces and replace them by 1 space

import re

def clean_data(data):
    return re.sub(" {2,}", " ", data.strip())

product_title = clean(product.css('h3::text').extract_first())

And then improve clean function anyway you like it

Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265