1

I am trying to split a text into several lists. I have tried several ways, but I had no success.

Here is an example:

text_1 = "A-0  100  20  10  A-1  100  12  6  A-2  100  10  5"

The result I would like to have is the following:

[['A-0', '100', '20', '10'], ['A-1', '100', '12', '6'], ['A-2', '100', '10', '5']]

I used regex to identify A- as a delimiter for the split. However, I am struggling splitting it. Maybe there is a better way to solve this?

This is just an example, since the solution I am using for a PDF data extractor I managed to built.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Abumaru
  • 33
  • 6
  • You could just split the text at every whitespace (using `text_1.split()`) and then group each four items in one sublist. What exactly did you try and what was the problem with it? – mkrieger1 May 13 '19 at 20:09
  • See https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks – mkrieger1 May 13 '19 at 20:12

5 Answers5

1

If you know you'll always have groups of 4, can play with zip and iter

x = iter(text_1.split())

Then

list(zip(*[x]*4)) # or list(zip(x,x,x,x))

Yields

[('A-0', '100', '20', '10'),
 ('A-1', '100', '12', '6'),
 ('A-2', '100', '10', '5')]
rafaelc
  • 57,686
  • 15
  • 58
  • 82
0

I think it might be a bit easier to do with the builtin string method .split. With this, you can do the following:

# Add whitespace at the end of text_1 so that 
# the final split will be the same format as all other splits

text_1="A-0 100 20 10 A-1 100 12 6 A-2 100 10 5" + " "


step1 = text_1.split("A-")

# [1:] here because we want to ignore the first empty string from split
step2 = ["A-" + i for i in step1[1:]] 

# [:-1] here because we know the last element in the new split will always be empty 
# because of the whitespace before the next "A-"
final = [i.split(' ')[:-1] for i in step2]

Final will be:

[['A-0', '100', '20', '10'], 
['A-1', '100', '12', '6'], 
['A-2', '100', '10', '5']]

This should work for arbitrary sized lists.

  • What happened to the 5 at the end? – mkrieger1 May 13 '19 at 20:16
  • Ah, the [:-1] is removing that because there's no trailing whitespace at the end of the string as there is in each of the other splits. I suppose you could add in a whitespace before doing the splits, with `text_1 + " "` That being said, I'm sure there's a more optimal to get around it while still making it work for arbitrary sized lists. – Tyler Roberts May 13 '19 at 20:32
  • Hi, I tried this option, just to make sure if it works with an arbitrary list. It did work ( some more extra job to clean up the data , though). However, Could you clarify me the following : mid = ["A-" + i for i in split[1:]] ( I quite dont understand "i for i in split[1:] and i.split('')[:-1] for i in mid. I get that "each i in split" but why "i for i, and why i.split('')[:-1] for i? – Abumaru May 14 '19 at 13:36
  • The `[i for i in iterable]` syntax is called list comprehension. You can read about it [here](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions). Essentially I'm re-creating the list called `split`, and adding back in the "A-" that was removed from `text_1.split()`. The reason that I use i.split(" ")[:-1] for the final is that for each of the strings in `mid`, there is a trailing whitespace at the end which would create an extra element that is just a blank string in each sub-list in `final`. – Tyler Roberts May 14 '19 at 17:06
  • Changed variable names in my example to be less confusing. `split` is now `step1` `mid` is now `step2` – Tyler Roberts May 14 '19 at 17:11
0

this is my soloution:

text_1 = "A-0  100  20  10  A-1  100  12  6  A-2  100  10  5"
# split text by space
text_array = text_1.split()
# result: ['A-0', '100', '20', '10', 'A-1', '100', '12', '6', 'A-2', '100', '10', '5']

# get array length
text_array_size = len(text_array)
# which is 12 in this case
formatted_text_array = []

# create a loop which runs 3 times and split youre array 4 by 4
for i in range(int(text_array_size/4)):
    formatted_text_array.append(text_array[i*4:i*4+4])

print(formatted_text_array)
# result: [['A-0', '100', '20', '10'], ['A-1', '100', '12', '6'], ['A-2', '100', '10', '5']]
mh. bitarafan
  • 886
  • 9
  • 16
0

If you want to use a regex(regexes are cool) and have a dynamic number of items in each sub list try this :

import re
text_1 = "A-0  100  20  10  A-1  100  12  6  A-2  100  10  5"
my_list = re.findall(r'A-[^A]*', text_1)
for i in range(0, my_list.__len__()):
    my_list[i] = my_list[i].split()
print(my_list)
BigMoose
  • 11
  • 2
0

A regex based approach –– since you're already using regex for your solution:

Code

from re import split

def split_lst(regex, string):
  return filter(lambda x: x.strip(), split(regex, string))

text_1 = "A-0  100  20  10  A-1  100  12  6  A-2  100  10  5"

print(list(map(
  lambda x: list(split_lst(r"\s", x)), 
  split_lst(r"(A-\d+\s+\d+\s+\d+\s+\d+)", text_1)
)))

result

[['A-0', '100', '20', '10'], ['A-1', '100', '12', '6'], ['A-2', '100', '10', '5']]

Repl.it link

sidmishraw
  • 390
  • 4
  • 15