2

I have below list which I have simplified :

my_list = ['select', 'fruit1', 'fruit2, 'fruit3', 'from', 'basket1',
           'select', 'fruit4', 'from', 'basket2',
           'select', 'fruit5', 'fruit6' 'from', 'basket3', ..... so on]

Note how my list has 'select' and 'from' statements.

The output I am trying to achieve is a DataFrame or let's say Excel output:

Fruit number      Basket number
fruit1            basket1
fruit2            basket1
fruit3            basket1
fruit4            basket2
fruit5            basket3
fruit6            basket3
.                 .
.                 .
.                 .
.                 .

is there a way to achieve this result? I have tried many things but it won't work.. :(

sam
  • 65
  • 5
  • 1
    What have you tried so far? I believe you would have to filter your list to remove the unwanted values (select, from). Then you should find a way to organize your list so that you'd have one fruit 'per line'. For instance `[['fruit1', 'basket1'], ['fruit2', 'basket1']]` – mquasar Dec 22 '21 at 14:01
  • In my opinion it was useful to construct an object first with my_list data like ```[ {'Fruit number':fruit1', 'Basket number':'basket1'},{'Fruit number':fruit2', 'Basket number':'basket1'}, ... ]```, and then convert it to a DataFrame. – Gui Dec 22 '21 at 14:05
  • Is there anything other than `select... from` lists in your data? – Stuart Dec 22 '21 at 14:06
  • @mquasar I have tried to put a while loop where index of "from" is greater than 2 so as to duplicate the basket.. eg. fruit 1, 2 and 3 are 3 elements but 'from' has only 1 element called basket1, so I was able to duplicate basket1 3 times and so on.. but it did not work quite well and failed if I changed the dataset.. – sam Dec 22 '21 at 14:06
  • @sam Put your code so we can see what's going on – mquasar Dec 22 '21 at 14:08
  • 1
    @Gui any suggestions will be appreciated because I have tried many things from my knowledge and on google, but nothing was able to achieve this output which I want.. :) and if you could put it in code, it will be easy for me to understand as well. thank you – sam Dec 22 '21 at 14:08
  • @Stuart there is nothing other than 'select', and 'from',. you can understand it as SQL query where there are multiple select and from statements given in a list. – sam Dec 22 '21 at 14:10
  • @sam, is your list example "real"? Is it a string list, or is like a list of string lists, like ```my_list = [ ['select', ... 'basket1'], ['select', ... 'basket2'] ]``` ? – Gui Dec 22 '21 at 14:17
  • @Gui unfortunately it is one long list. Not list in a list or nested list. It is exactly like how I have pasted in the question.. :( – sam Dec 22 '21 at 14:21

4 Answers4

2

something like the below (use a simple "state machine")

import pandas as pd
lst = ['select', 'fruit1', 'fruit2', 'fruit3', 'from', 'basket1',
       'select', 'fruit4', 'from', 'basket2',
       'select', 'fruit5', 'fruit6', 'from', 'basket3']

data = []
fruits = []
state = 'select'
for word in lst:
  if word == 'select':
    state = 'select'
    continue
  if word == 'from':
    state = 'basket'
    continue
  if state == 'select':
    fruits.append(word)
  if state == 'basket':
    for f in fruits:
      data.append({'fruit':f,'basket':word})
    fruits = []

df = pd.DataFrame(data)
print(df)

output

    fruit   basket
0  fruit1  basket1
1  fruit2  basket1
2  fruit3  basket1
3  fruit4  basket2
4  fruit5  basket3
5  fruit6  basket3
balderman
  • 22,927
  • 7
  • 34
  • 52
1

There are a lot of ways to do this. This approach gets the index of all the 'from', and splits 2 spaces ahead using np.split so that the start of each new array is a 'select'. The last one is empty, so we will drop that.

Then you can build a dict by slicing up each array, and make a dataframe out of it.

import numpy as np
import pandas as pd
my_list = ['select', 'fruit1', 'fruit2', 'fruit3', 'from', 'basket1',
           'select', 'fruit4', 'from', 'basket2',
          'select', 'fruit5', 'fruit6', 'from', 'basket3']

f = [i+2 for i, x in enumerate(my_list) if x == "from"][:-1]
s = np.split(my_list,f)

df = pd.DataFrame([{'basket':q[-1],'fruits':q[1:-2]} for q in s])
df = df.explode('fruits')

Output

    basket  fruits
0  basket1  fruit1
0  basket1  fruit2
0  basket1  fruit3
1  basket2  fruit4
2  basket3  fruit5
2  basket3  fruit6
Chris
  • 15,819
  • 3
  • 24
  • 37
0
data = {'Select' : {'Fruit_Number': 
['fruit1','fruit2','fruit3']},'From' : {'Basket_Number': 
['basket1','basket2','basket3']}}

data2 = data['Select']
data3 = data['From']

df2 = pd.DataFrame.from_dict(data2)
df3 = pd.DataFrame.from_dict(data3)

l = [df2,df3]
df_all = pd.concat(l,axis=1)


      Fruit_Number Basket_Number
0       fruit1       basket1
1       fruit2       basket2
2       fruit3       basket3
B.Quinn
  • 77
  • 1
  • 7
  • 1
    I believe the conversion to this format is what they want help with. – Chris Dec 22 '21 at 14:09
  • Agree!!! But it depends if the data source is easily manipulable... – Gui Dec 22 '21 at 14:11
  • @B.Quinn this method won't work because note how fruit 1, 2 and 3 should be called against basket1..then fruit 4 in basket2 and then...fruit 5, 6 in basket 3.. I am not able to find a workaround to this.. – sam Dec 22 '21 at 14:14
  • @Chris yes, correct.. number of fruits should be equal to number of baskets.. and then converted into DataFrame.. – sam Dec 22 '21 at 14:17
0

Make a generic and reusable split function like the ones in the answers to this question. Then it is easier to yield pairs from each split group.

def split(sequence, sep):
    group = []
    for item in sequence:
        if item == sep:
            yield group
            group = []
        else:
            group.append(item)
    yield group
    
def parse_select(tokens):
    for group in split(tokens, "select"):
        for item in group[:-2]:
            yield item, group[-1]
        
import pandas as pd
print(pd.DataFrame(parse_select(my_list)))

or alternatively:

def parse_select(tokens):
    for group in split(tokens, "select"):
        if group:
            items, (basket,) = split(group, "from")
            for item in items:
                yield item, basket
Stuart
  • 9,597
  • 1
  • 21
  • 30