List of lists into dataframe in pandas

Question

I have a list of lists that I want to turn into a dataframe, keeping their index in the original list as well.

x = [["a", "b", "c"], ["A", "B"], ["AA", "BB", "CC"]]

I can do this with a for loop like this:

result = []
for id, row in enumerate(x):
    d = pd.DataFrame({"attr": row, "id": [id]*len(row)})
    result.append(d)
result = pd.concat(result, ignore_index=True)

Or the equivalent generator expression:

pd.concat((pd.DataFrame({"attr": row, "id": [id]*len(row)}) 
           for id, row in enumerate(x)), ignore_index=True)

Both works fine, producing a data frame like:

But it feels like there should be a more 'panda-esque' way of doing it than with a list-loop-append pattern or the equivalent generator.

Can I create the dataframe above with a pandas call, i.e. without the for loop or python comprehension?

(preferably also a faster solution: on the 'genres' of the movie lens data set at https://grouplens.org/datasets/movielens/ this takes >4 seconds to flatten list of genres per movie, even though it is only 20k entries in total...)

Make sure you mark the best answer with the green check mark so it becomes the accepted answer. — A.Kot, Feb 13 '17 at 22:34

score 1 · Answer 1 · answered Feb 10 '17 at 02:17

1

I believe stack() is what you are looking for:

pd.DataFrame(x).stack().reset_index().drop('level_1', axis=1)

answered Feb 10 '17 at 02:17

A.Kot

7,615
2
22
24

score 0 · Answer 2 · edited May 23 '17 at 10:29

It seems to me that what you need is a fast way to flatten that x list and also create another list of ids. There is a well read post on efficiently flattening lists.

You can just tweak the basic flattening list comprehension to quickly generate your ids.

x = [["a", "b", "c"], ["A", "B"], ["AA", "BB", "CC"]]
attr = [attr for sublist in  x for attr in sublist]
id = [id for sublist in  [[i]*len(r) for i,r in enumerate(x)] for id in sublist]
df = pd.DataFrame({'attr': attr, 'id': id })
df
>>>  
  attr  id
0    a   0
1    b   0
2    c   0
3    A   1
4    B   1
5   AA   2
6   BB   2
7   CC   2

# Testing the time to flatten 20k nested lists
import timeit

setup = '''
vals = [[1], [1,2], [1,2,3], [1,2,3,4]]*5000
lots_of_ids = [attr for sublist in  [[i]*len(r) for i,r in enumerate(vals)] for attr in sublist]
'''

print min(timeit.Timer(setup=setup).repeat(10))
>>> 0.0471019744873

List of lists into dataframe in pandas

2 Answers2