I have my first serious question in python.
I have a few nested lists that I need to convert to pandas DataFrame. Seems easy, but what makes it challenging for me: - the lists are huge (so the code needs to be fast) - they are nested - when they are nested, I need combinations.
So having this input:
la = ['a', 'b', 'c', 'd', 'e']
lb = [[1], [2], [3, 33], [11,12,13], [4]]
lc = [[1], [2, 22], [3], [11,12,13], [4]]
I need the below as output
la lb lc
a 1 1
b 2 2
b 2 22
c 3 3
c 33 3
d 11 11
d 11 12
d 11 13
d 12 11
d 12 12
d 12 13
d 13 11
d 13 12
d 13 13
e 4 4
Note that I need all permutations whenever I have a nested list. At first I tried simply:
import pandas as pd
pd.DataFrame({'la' : [x for x in la],
'lb' : [x for x in lb],
'lc' : [x for x in lc]})
But looking for rows that need expanding and actually expanding (a huge) DataFrame seemed harder than tinkering around the way I create the DataFrame.
I looked at some great posts about itertools (Flattening a shallow list in Python ), the documentation (https://docs.python.org/3.6/library/itertools.html) and generators (What does the "yield" keyword do?), and came up with something like this:
import itertools
def f(la, lb, lc):
tmp = len(la) == len(lb) == len(lc)
if tmp:
for item in range(len(la)):
len_b = len(lb[item])
len_c = len(lc[item])
if ((len_b>1) or (len_c>1)):
yield list(itertools.product(la[item], lb[item], lc[item]))
## above: list is not the result I need,
## without it it breaks (not an iterable)
else:
yield (la[item], lb[item], lc[item])
else:
print('error: unequal length')
which I test
my_gen =f(lit1, lit2, lit3)
pd.DataFrame.from_records(my_gen)
which... well... breaks when i yield
itertools
(it has no length), and creates a wrong data structure after I cast itertools
to an iterable.
My questions are as follow:
- how can I fix that issue with
yield
ingitertools
? - is this efficient? In real application I will be creating the lists by parsing a file and they will be huge... Any performance tips or better solutions from more advanced colleagues? Right not it breaks/misbehaves so I can't even benchmark...
- would it make sense to generate the lists element by element and then use my
f
function?
Thank you in advance!