Convert a text into a nested list

Question

I am trying to split a text into several lists. I have tried several ways, but I had no success.

Here is an example:

text_1 = "A-0  100  20  10  A-1  100  12  6  A-2  100  10  5"

The result I would like to have is the following:

[['A-0', '100', '20', '10'], ['A-1', '100', '12', '6'], ['A-2', '100', '10', '5']]

I used regex to identify A- as a delimiter for the split. However, I am struggling splitting it. Maybe there is a better way to solve this?

This is just an example, since the solution I am using for a PDF data extractor I managed to built.

You could just split the text at every whitespace (using `text_1.split()`) and then group each four items in one sublist. What exactly did you try and what was the problem with it? — mkrieger1, May 13 '19 at 20:09
See https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks — mkrieger1, May 13 '19 at 20:12

score 1 · Answer 1 · answered May 13 '19 at 20:10

1

If you know you'll always have groups of 4, can play with zip and iter

x = iter(text_1.split())

Then

list(zip(*[x]*4)) # or list(zip(x,x,x,x))

Yields

[('A-0', '100', '20', '10'),
 ('A-1', '100', '12', '6'),
 ('A-2', '100', '10', '5')]

answered May 13 '19 at 20:10

rafaelc

57,686
15
58
82

Tyler Roberts · Accepted Answer · 2019-05-14T17:11:26.740

0

I think it might be a bit easier to do with the builtin string method .split. With this, you can do the following:

# Add whitespace at the end of text_1 so that 
# the final split will be the same format as all other splits

text_1="A-0 100 20 10 A-1 100 12 6 A-2 100 10 5" + " "


step1 = text_1.split("A-")

# [1:] here because we want to ignore the first empty string from split
step2 = ["A-" + i for i in step1[1:]] 

# [:-1] here because we know the last element in the new split will always be empty 
# because of the whitespace before the next "A-"
final = [i.split(' ')[:-1] for i in step2]

Final will be:

[['A-0', '100', '20', '10'], 
['A-1', '100', '12', '6'], 
['A-2', '100', '10', '5']]

This should work for arbitrary sized lists.

edited May 14 '19 at 17:11

answered May 13 '19 at 20:13

Tyler Roberts

28
5

What happened to the 5 at the end? – mkrieger1 May 13 '19 at 20:16
Ah, the [:-1] is removing that because there's no trailing whitespace at the end of the string as there is in each of the other splits. I suppose you could add in a whitespace before doing the splits, with `text_1 + " "` That being said, I'm sure there's a more optimal to get around it while still making it work for arbitrary sized lists. – Tyler Roberts May 13 '19 at 20:32
Hi, I tried this option, just to make sure if it works with an arbitrary list. It did work ( some more extra job to clean up the data , though). However, Could you clarify me the following : mid = ["A-" + i for i in split[1:]] ( I quite dont understand "i for i in split[1:] and i.split('')[:-1] for i in mid. I get that "each i in split" but why "i for i, and why i.split('')[:-1] for i? – Abumaru May 14 '19 at 13:36
The `[i for i in iterable]` syntax is called list comprehension. You can read about it [here](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions). Essentially I'm re-creating the list called `split`, and adding back in the "A-" that was removed from `text_1.split()`. The reason that I use i.split(" ")[:-1] for the final is that for each of the strings in `mid`, there is a trailing whitespace at the end which would create an extra element that is just a blank string in each sub-list in `final`. – Tyler Roberts May 14 '19 at 17:06
Changed variable names in my example to be less confusing. `split` is now `step1` `mid` is now `step2` – Tyler Roberts May 14 '19 at 17:11

score 0 · Answer 3 · answered May 13 '19 at 21:25

this is my soloution:

text_1 = "A-0  100  20  10  A-1  100  12  6  A-2  100  10  5"
# split text by space
text_array = text_1.split()
# result: ['A-0', '100', '20', '10', 'A-1', '100', '12', '6', 'A-2', '100', '10', '5']

# get array length
text_array_size = len(text_array)
# which is 12 in this case
formatted_text_array = []

# create a loop which runs 3 times and split youre array 4 by 4
for i in range(int(text_array_size/4)):
    formatted_text_array.append(text_array[i*4:i*4+4])

print(formatted_text_array)
# result: [['A-0', '100', '20', '10'], ['A-1', '100', '12', '6'], ['A-2', '100', '10', '5']]

BigMoose · Answer 4 · 2019-05-13T22:24:50.223

0

If you want to use a regex(regexes are cool) and have a dynamic number of items in each sub list try this :

import re
text_1 = "A-0  100  20  10  A-1  100  12  6  A-2  100  10  5"
my_list = re.findall(r'A-[^A]*', text_1)
for i in range(0, my_list.__len__()):
    my_list[i] = my_list[i].split()
print(my_list)

edited May 13 '19 at 22:24

answered May 13 '19 at 22:11

BigMoose

11
2

score 0 · Answer 5 · answered May 13 '19 at 23:18

A regex based approach –– since you're already using regex for your solution:

Code

from re import split

def split_lst(regex, string):
  return filter(lambda x: x.strip(), split(regex, string))

text_1 = "A-0  100  20  10  A-1  100  12  6  A-2  100  10  5"

print(list(map(
  lambda x: list(split_lst(r"\s", x)), 
  split_lst(r"(A-\d+\s+\d+\s+\d+\s+\d+)", text_1)
)))

result

[['A-0', '100', '20', '10'], ['A-1', '100', '12', '6'], ['A-2', '100', '10', '5']]

Convert a text into a nested list

5 Answers5

Code

result

Repl.it link