How to split by newline and ignore blank lines using regex?

Question

Lets say I have this data

data = '''a, b, c
d, e, f
g. h, i
  
j, k , l


'''

4th line contains one single space, 6th and 7th line does not contain any space, just a blank new line.

Now when I split the same using splitlines

data.splitlines()

I get

['a, b, c', 'd, e, f', 'g. h, i', ' ', 'j, k , l', '', '']

However expected was just

['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']

Is there a simple solution using regular expressions to do this.

Please note that I know the other way of doing the same by filtering empty strings from the output of splitlines()

I am not sure if the same can be achieved using regex.

When I use regex to split on new line, it gives me

import re
re.split("\n", data)

Output :

['a, b, c', 'd,e,f', 'g. h, i', ' ', 'j, k , l', '', '', '']

Does this answer your question? [Remove empty strings from a list of strings](https://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings) — Pranav Hosangadi, Jun 17 '22 at 05:36
@PranavHosangadi as I said, I don't want to filter later, can you suggest me a regex kind of solution? — , Jun 17 '22 at 05:38
Why are you so insistent on using *re* when you've been offered a perfectly good answer that requires no additional imports? — DarkKnight, Jun 17 '22 at 05:44

crissal · Answer 1 · 2022-06-17T05:44:31.707

1

List comprehension approach

You can add elements to your list if they are not empty strings or whitespace ones with a condition check.

If the element/line is True after stripping it from whitespaces, then it is different from an empty string, thus you add it to your list.

filtered_data = [el for el in data.splitlines() if el.strip()]
# ['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']

Regexp approach

import re
p = re.compile(r"^([^\s]+.+)", re.M)
p.findall(data)
# ['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']

edited Jun 17 '22 at 05:44

answered Jun 17 '22 at 05:35

crissal

2,547
7
25

As I said, I know I can filter later and I don't wanna do that. I was looking more of a simple approach – Jun 17 '22 at 05:36
1

Simpler than this? – crissal Jun 17 '22 at 05:37
2

It’s not simpler to use regex. Filtering is a one-liner. – Michael Dorner Jun 17 '22 at 05:38
okay, agreed that is simple, but can someone suggest me a regex solution as well, cuz I don't want to go over my elements again in the loop – Jun 17 '22 at 05:39
@GBDGBDA What do you mean by "go over my elements **again**"? This answer only iterates over the elements once – DarkKnight Jun 17 '22 at 05:41
@GBDGBDA I don't know why the regexp approach, but this should be fine. – crissal Jun 17 '22 at 05:45
May not matter, but the regex will delete lines starting with whitespace. – Mark Jun 17 '22 at 05:47
@Mark yes its deleting those lines, it does matter – Jun 17 '22 at 05:49
That's why the regex should not be used in the first place – crissal Jun 17 '22 at 05:52
@GBDGBDA if the line begins (or ends) with white space, do you want to preserve that whitespace in the results? – Mark Jun 17 '22 at 06:03
I don't have problem with those lines that actually contains some character other than whitespace, so yes – Jun 17 '22 at 06:23

Pranav Hosangadi · Accepted Answer · 2022-06-17T12:48:26.577

1

I disagree with your assessment that filtering is more complicated than using regular expressions. However, if you really want to use regex, you could split at multiple consecutive newlines like so:

>>> re.split(r"\n+", data)
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l', '']

Unfortunately, this leaves an empty string at the end of your list. To get around this, use re.findall to find everything that isn't a newline:

>>> re.findall(r"([^\n]+)", data)
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']

Since that regex doesn't work on input with spaces, here's one that does:

>>> re.findall(r"^([ \t]*\S.*)$", data, re.MULTILINE)
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l ']

Here's the explanation:

^([ \t]*\S.*)$
^            $   : Start of line and end of line
 (          )    : Capturing group
  [ \t]*         : Zero or more of blank space or tab (i.e. whitespace that isn't newline
        \S       : One non-whitespace character
          .*     : Zero or more of any character

edited Jun 17 '22 at 12:48

answered Jun 17 '22 at 05:41

Pranav Hosangadi

23,755
7
44
70

Yes I need something like your second approach, but your second one gives me `['a, b, c', 'd,e,f', 'g. h, i', ' ', 'j, k , l']` – Jun 17 '22 at 05:43
I think the OP's data has spaces in the empty lines. – Mark Jun 17 '22 at 05:43
@Mark yeah, somehow Spyder removed the extra spaces and I thought that regex worked. I've edited my answer. – Pranav Hosangadi Jun 17 '22 at 08:35
@GBDGBDA See my edits. – Pranav Hosangadi Jun 17 '22 at 08:35

How to split by newline and ignore blank lines using regex?

2 Answers2

List comprehension approach

Regexp approach