0

I have a word list like this:

river
mississippi
water
spain
cairo


hellow
kind
words
sentences

They are separated by different number of '\n'

What I want to do is to put these words which separate by one '\n' in an inner list, and the words which separate by more than one (could be 2,3 or more) '\n' in different inner list like this:

[['river', 'mississippi', 'water', 'spain', 'cairo'], ['hellow','kind','words','sentences']]

I tried

infile=open(test_sets_file,'r')
readed=infile.readlines()
newlist=[]
new_nestedlist=[]
for i in range(len(readed)):
    if readed[i]!='\n':
        new_nestedlist.append(readed[i].strip('\n'))
    else:
        newlist.append(new_nestedlist)
        new_nestedlist=[]
return newlist

it doesn't work My code cannot print anything when the input text is

river
mississippi
water
spain
cairo

I know it is because I initialized the list as an empty one when meet a '\n'

I also found another question Creating nested list from string data with two delimiters in Python about creating nested list by different separators, but it cannot solve my question

Yiling Liu
  • 666
  • 1
  • 6
  • 21

3 Answers3

1

You can use split with regex the multiple \n (\n\n+ => 2 or more \n in a row) and then split each part by a single \n:

By the way, it's preferable to use with when working with files (for a proper file closing ant context managing):

import re

def transform(data):
    return [x.split('\n') for x in re.compile('\n\n+').split(data)]

with open(test_sets_file, 'r') as infile:
    # Read the entire file to a single string
    # data = infile.read()

    # First example
    data = 'river\nmississippi\nwater\nspain\ncairo\n\n\nhellow\nkind\nwords\nsentences'    
    print(transform(data))  # [['river', 'mississippi', 'water', 'spain', 'cairo'], ['hellow', 'kind', 'words', 'sentences']]

    # Second example
    data = 'river\nmississippi\nwater\nspain\ncairo'

    print(transform(data))  # [['river', 'mississippi', 'water', 'spain', 'cairo']]
Maor Refaeli
  • 2,417
  • 2
  • 19
  • 33
1

You can first split based on multiple occurences of \n by using a regular expression. Assuming your input is in the variable string, we can do the following

import re
first_split = re.compile('\n\n+').split(string)

Then you can further divide each individual string based on a single \n

second_split = [x.split('\n') for x in first_split]

This yields

[['river', 'mississippi', 'water', 'spain', 'cairo'], ['hellow', 'kind', 'words', 'sentences']]
Thijs van Ede
  • 861
  • 6
  • 15
1

You can do this with the str.splitlines method. We also use str.rstrip to clean up any blank spaces (or tabs) at the ends of lines. We don't have to worry about newlines, since .splitlines takes care of those.

The idea is that if there are any blank lines or lines just containing whitespace, they will get converted to empty strings by the combined action of .splitlines & .rstrip. So when we encounter an empty row, if we have data in the inner buffer we append it to the nested output buffer, and create a new empty inner buffer. Otherwise, we just append the current row to the inner buffer. When we get to the end of the data we also need to save any data from inner to nested.

data = '''\
river
mississippi
water
spain
cairo


hellow
kind
words
sentences
'''

nested = []
inner = []
for row in data.splitlines():
    # Remove any trailing whitespace
    row = row.rstrip()
    if row:
        inner.append(row)
    elif inner:
        nested.append(inner)
        inner = []
if inner:
    nested.append(inner)

print(nested)

output

[['river', 'mississippi', 'water', 'spain', 'cairo'], ['hellow', 'kind', 'words', 'sentences']]

Note that it's easy to adapt this code to reading line by line directly from a file. You don't need to read the whole file into a list before you start work on it. Eg,

nested = []
inner = []
with open("test_sets_file") as data:
    for row in data:
        # Remove any trailing whitespace, including newline
        row = row.rstrip()
        if row:
            inner.append(row)
        elif inner:
            nested.append(inner)
            inner = []
    if inner:
        nested.append(inner)

print(nested)
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182