-2

Say I have the following list:

l1 = ['apples', ' bananas' , '  coconuts', '   dates figs guavas', 'lemons ', 'mangoes  ']

What would be the best way of extracting each word and discarding the extra spaces?

The result I am after is:

l2 = ['apples', 'bananas', 'coconuts', 'dates', 'figs', 'guavas', 'lemons', 'mangoes']

What I've tried so far is:

    clean_l = []

    # Get rid of white spaces 
    for item in l1:
        clean = re.sub("(?m)^\s+", "", item)
        clean_l.append(clean)

but this returns the exact same thing as l1.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Zizzipupp
  • 1,301
  • 1
  • 11
  • 27

1 Answers1

1

Use:

l1 = ['apples', ' bananas' , '  coconuts', '   dates figs guavas', 'lemons ', 'mangoes  ']
res = [ei for e in l1 for ei in e.strip().split()]
print(res)

Output

['apples', 'bananas', 'coconuts', 'dates', 'figs', 'guavas', 'lemons', 'mangoes']

If you insist on using regular expression, although I don't recommend it for this particular problem (see here), use:

import re

l1 = ['apples', ' bananas', '  coconuts', '   dates figs guavas', 'lemons ', 'mangoes  ']
res = [ei for e in l1 for ei in re.findall(r"\w+", e)]
print(res)

Output

['apples', 'bananas', 'coconuts', 'dates', 'figs', 'guavas', 'lemons', 'mangoes']

A third alternative (by @WiktorStribiżew) is to use:

res = " ".join(l1).split()

Timings

l1 = ['apples', ' bananas', '  coconuts', '   dates figs guavas', 'lemons ', 'mangoes  '] * 1000
import re
%timeit [ei for e in l1 for ei in e.strip().split()]
1.76 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit " ".join(l1).split()
453 µs ± 3.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [ei for e in l1 for ei in re.findall(r"\w+", e)]
7.77 ms ± 59.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76