3

I am reading lines from an input file and split each of the line into list. However, I encountered the following situation that confused me.

This is my code:

with open("filename") as in_file:
    for line in in_file:
        print re.split(r'([\s,:()\[\]=|/\\{}\'\"<>]+)', line)

This is the demonstration of my input file:

PREREQUISITES

    CUDA 7.0 and a GPU of compute capability 3.0 or higher are required.


    Extract the cuDNN archive to a directory of your choice, referred to below as <installpath>.
    Then follow the platform-specific instructions as follows.

And this is the output results I got:

['PREREQUISITES', '\n', '']
['', '\n', '']
['', '    ', 'CUDA', ' ', '7.0', ' ', 'and', ' ', 'a', ' ', 'GPU', ' ', 'of', ' ', 'compute', ' ', 'capability', ' ', '3.0', ' ', 'or', ' ', 'higher', ' ', 'are', ' ', 'required.', '\n', '']
['', '\n', '']
['', '\n', '']
['', '    ', 'Extract', ' ', 'the', ' ', 'cuDNN', ' ', 'archive', ' ', 'to', ' ', 'a', ' ', 'directory', ' ', 'of', ' ', 'your', ' ', 'choice', ', ', 'referred', ' ', 'to', ' ', 'below', ' ', 'as', ' <', 'installpath', '>', '.', '\n', '']
['', '    ', 'Then', ' ', 'follow', ' ', 'the', ' ', 'platform-specific', ' ', 'instructions', ' ', 'as', ' ', 'follows.', '\n', '']

My questions are:

Q1: At the end of each line, besides of the character \n, there is an another empty element ''. What is that?

Q2: esides of the first, all the other lines are starting with this empty element ''. Why is that?

Edit:

Added question Q3: I want the the delimiters such as ' ' and '\n' kept in the results but not this empty emement ''. Is there any way to do this?

Answer to question Q1-2: here.

Answer to question Q3: here.

Community
  • 1
  • 1
fluency03
  • 2,637
  • 7
  • 32
  • 62
  • 2
    [*If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string*](https://docs.python.org/2/library/re.html#re.split) – Wiktor Stribiżew Mar 16 '16 at 10:32
  • But why the first line does not have such empty string? – fluency03 Mar 16 '16 at 10:33
  • \s match \n and splitting with (\s), will get you `'', \n, ''`, which is normal – YOU Mar 16 '16 at 10:34
  • @YOU But why the first line does not start with such empty string? – fluency03 Mar 16 '16 at 10:40
  • 1
    because first line doesnt start with \n? – YOU Mar 16 '16 at 10:49
  • 1
    the blank '' on second line and onwards, is because of indentation, which is match by \s too – YOU Mar 16 '16 at 10:55
  • @YOU I want the the delimiters such as ' ' and '\n' kept in the results but not this empty emement ''. Is there any way to do this? – fluency03 Mar 16 '16 at 13:17

1 Answers1

1

The empty string indicates that '\n' was matched as the last character in the line and there is no more data after it. That is:

>>> re.split(r'([\s]+)', 'hello world\n')
['hello', ' ', 'world', '\n', '']

Should produce a different result than:

>>> re.split(r'([\s]+)', 'hello world')
['hello', ' ', 'world']

You can either strip the line before splitting it:

>>> re.split(r'([\s]+)', 'hello world\n'.strip())
['hello', ' ', 'world']

Or invert the regex and use findall instead. findall will work differently in that it will not produce the sequences between the matching text.

>>> re.findall(r'([^\s]+)', 'hello world\n')
['hello', 'world']
Dunes
  • 37,291
  • 7
  • 81
  • 97
  • I want the matching text (the delimiters in the `split`) such as `' '` and `'\n'` but not this empty emement `''`. Is there any way to do this? – fluency03 Mar 16 '16 at 11:56