Extract substrings without any spaces from list of strings

Question

Say I have the following list:

l1 = ['apples', ' bananas' , '  coconuts', '   dates figs guavas', 'lemons ', 'mangoes  ']

What would be the best way of extracting each word and discarding the extra spaces?

The result I am after is:

l2 = ['apples', 'bananas', 'coconuts', 'dates', 'figs', 'guavas', 'lemons', 'mangoes']

What I've tried so far is:

    clean_l = []

    # Get rid of white spaces 
    for item in l1:
        clean = re.sub("(?m)^\s+", "", item)
        clean_l.append(clean)

but this returns the exact same thing as l1.

For one thing your regex explicitly only finds whitespace at the _start_ of a string. — jonrsharpe, Oct 25 '21 at 12:38
You might as well use [`" ".join(l1).split()`](https://ideone.com/Ak6Rp8). — Wiktor Stribiżew, Oct 25 '21 at 12:41
The easiest with regex might be `re.findall`: `[w for string in l1 for w in re.findall("\w+", string)]` — user2390182, Oct 25 '21 at 12:44
@DaniMesejo Output: `['apples', 'bananas', 'coconuts', 'dates', 'figs', 'guavas', 'lemons', 'mangoes']` — Wiktor Stribiżew, Oct 25 '21 at 12:52

Dani Mesejo · Accepted Answer · 2021-10-25T13:00:39.073

Use:

l1 = ['apples', ' bananas' , '  coconuts', '   dates figs guavas', 'lemons ', 'mangoes  ']
res = [ei for e in l1 for ei in e.strip().split()]
print(res)

Output

['apples', 'bananas', 'coconuts', 'dates', 'figs', 'guavas', 'lemons', 'mangoes']

If you insist on using regular expression, although I don't recommend it for this particular problem (see here), use:

import re

l1 = ['apples', ' bananas', '  coconuts', '   dates figs guavas', 'lemons ', 'mangoes  ']
res = [ei for e in l1 for ei in re.findall(r"\w+", e)]
print(res)

Output

['apples', 'bananas', 'coconuts', 'dates', 'figs', 'guavas', 'lemons', 'mangoes']

A third alternative (by @WiktorStribiżew) is to use:

res = " ".join(l1).split()

Timings

l1 = ['apples', ' bananas', '  coconuts', '   dates figs guavas', 'lemons ', 'mangoes  '] * 1000
import re
%timeit [ei for e in l1 for ei in e.strip().split()]
1.76 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit " ".join(l1).split()
453 µs ± 3.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [ei for e in l1 for ei in re.findall(r"\w+", e)]
7.77 ms ± 59.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jonrsharpe already commented in the question, but I try to avoid regex when possible — Dani Mesejo, Oct 25 '21 at 12:40

Extract substrings without any spaces from list of strings

1 Answers1