Python: how to get correct list when split?

Question

In test.txt, I have 2 lines of sentences.

The heart was made to be broken.
There is no surprise more magical than the surprise of being loved.

The code:

import re
file = open('/test.txt','r')#specify file to open
data = file.readlines()
file.close()
for line in data:
    line_split = re.split(r'[ \t\n\r, ]+',line)
    print line_split

Results from the codes:

['The', 'heart', 'was', 'made', 'to', 'be', 'broken.', '']
['There', 'is', 'no', 'surprise', 'more', 'magical', 'than', 'the', 'surprise', 'of', 'being', 'loved.']

How to get only word print out? (see the first sentence) Expect result:

['The', 'heart', 'was', 'made', 'to', 'be', 'broken.']
['There', 'is', 'no', 'surprise', 'more', 'magical', 'than', 'the', 'surprise', 'of', 'being', 'loved.']

Any advice?

It would be more productive if you told us what you're actually trying to do, instead of posting nearly identical questions. Also, `[ \t\n\r, ]` doesn't make sense, you want `[\s,]`. — georg, May 05 '12 at 21:22

Mark Byers · Accepted Answer · 2012-05-05T21:03:36.130

3

Instead of using split to match the delimiters, you can use findall with the negated regular expression to match the parts you want to keep:

line_split = re.findall(r'[^ \t\n\r., ]+',line)

See it working online: ideone

edited May 05 '12 at 21:03

answered May 05 '12 at 20:58

Mark Byers

811,555
193
1,581
1,452

OMG! it my fault...I forget ^... I am so sorry. – ThanaDaray May 05 '12 at 21:03
@ThanaDaray: Don't worry. It's probably my fault for not making the answer clear enough. Sorry! – Mark Byers May 05 '12 at 21:09
+1, this is more elegant than my answer, but do take a look at mine for the other comments I made. – Gareth Latty May 05 '12 at 21:11

Gareth Latty · Answer 2 · 2012-05-05T21:22:57.087

1

To fix, with a few other changes, explained further on:

import re

with open("test.txt", "r") as file:
    for line in file:
        line_split = filter(bool, re.split(r'[ \t\n\r, ]+', line))
        print(line_split)

Here we use a filter() to remove any empty strings from the result.

Note my use of the with statement to open the file. This is more readable and handles closing the file for you, even on exceptions.

We also loop directly over the file - this is a better idea as it doesn't load the entire file into memory at once, which is not needed and could cause problems with big files.

edited May 05 '12 at 21:22

answered May 05 '12 at 20:59

Gareth Latty

86,389
17
178
183

1

Except that this does not split on, and drop, commas. – May 05 '12 at 21:00
@delnan Ugh, should have read that more closely on my part, still, an easy fix. – Gareth Latty May 05 '12 at 21:01
Unfortunately not. `str.split()` skips over multiple consecutive whitespace characters, but if you give it explicit separators, consecutive separators result in empty strings in the output (i.e. `"foo, bar"` would be split into `['foo', '', 'bar']`). Yes, that's ugly. – May 05 '12 at 21:05
@delnan You know what, I was sure that was not the way it functions, apparently my memory is poor, I'll change it to a regex solution. Edit: Updated, and this time I tested it - definitely works. – Gareth Latty May 05 '12 at 21:07
This can still give empty "words" if the line ends with a comma: http://ideone.com/rci8S – Mark Byers May 05 '12 at 21:08
@MarkByers Of course it does. I'm doing well here - although I'm going to blame the asker a little for providing a test case that doesn't cover the wanted functionality. – Gareth Latty May 05 '12 at 21:09
Yes, I also forget it all the time. What's worse, when looking it up to be sure, I remembered `str.split` only accepts a single separator string, not a variety of characters which may each split. Meaning, your original code would only split on the 5-char sequence `\t\n\t, `. And they say Python makes string handling easy... well, easier than C I suppose. – May 05 '12 at 21:09
@delnan Yeah, I noticed that as soon as I tested it. As I say, I was sure I remember that being the functionality. Maybe that's PHP or something from the past, I don't know. To be fair, splitting on multiple character strings is probably something I use more than on a selection of characters. Anyway, this version should work. Probably not the most elegant, but I'll leave it up for the other advice. – Gareth Latty May 05 '12 at 21:11
+1 for fixing it (after a few attempts) and some other good pieces of advice. PS: you can also [remove empty strings](http://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings) slightly more concisely using filter, though I'm not 100% convinced that it's better to use this trick rather than the more simple list comprehension you've used. – Mark Byers May 05 '12 at 21:14
@MarkByers Yeah, I've never been a functional programming man, so `map()` and `filter()` never really appeal to me. In this situation, you would need a `lambda` which pretty much kills any performance or readability gains, in my opinion. It's definitely an option here though, there isn't much in it, as `[x for x in ...]` is always a bit ugly. – Gareth Latty May 05 '12 at 21:17
@Lattyware: Actually if you read the top voted answer on the [question I linked to](http://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings), you can see that you don't actually need a lambda because you can just use the trick `filter(None, x)`. – Mark Byers May 05 '12 at 21:20
@MarkByers Oh, I didn't see that link. That is considerably nicer, although I feel `filter(bool, xs)` is more obvious (and I should have thought of it earlier, not doing well on this one at all). – Gareth Latty May 05 '12 at 21:21

score 1 · Answer 3 · answered May 05 '12 at 21:38

1

words = re.compile(r"[\w']+").findall(yourString)

Demo

>>> yourString = "Mary's lamb was white as snow."
["Mary's", 'lamb', 'was', 'white', 'as', 'snow']

If you really do want periods, you can add those as [\w'\.]

answered May 05 '12 at 21:38

ninjagecko

88,546
24
137
145

score 0 · Answer 4 · answered May 06 '12 at 04:50

In [2]: with open('test.txt','r') as f:
   ...:     lines = f.readlines()
   ...:

In [3]: words = [l.split() for l in lines]

In [4]: words
Out[4]:
[['The', 'heart', 'was', 'made', 'to', 'be', 'broken.'],
 ['There',
  'is',
  'no',
  'surprise',
  'more',
  'magical',
  'than',
  'the',
  'surprise',
  'of',
  'being',
  'loved.']]

Python: how to get correct list when split?

4 Answers4