2

I have a pandas dataframe in which the column values exist as lists. Each list has several elements and one element can exist in several rows. An example dataframe is:

X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B'])

X = 
 A          B
0  1  [a, b, c]
1  2  [a, b]
2  3     [c, d]

I want to find all the rows, i.e. dataframe indexes, corresponding to elements in the lists, and create a dictionary out of it. Disregard column A here, as column B is the one of interest! So element 'a' occurs in index 0,1, which gives {'a':[0,1]}. The solution for this example dataframe is:

Y = {'a':[0,1],'b':[0,1],'c':[0,2],'d':[2]}

I have written a code that works fine, and I can get a result. My problem is more to do with the speed of computation. My actual dataframe has about 350,000 rows and the lists in the column 'B' can contain up to 1,000 elements. But at present the code is running for several hours! I was wondering whether my solution is very inefficient. Any help with a faster more efficient way will be really appreciated! Here is my solution code:

import itertools
import pandas as pd
X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B'])
B_dict = []
for idx,val in X.iterrows():
    B = val['B']
    B_dict.append(dict(zip(B,[[idx]]*len(B))))
    B_dict = [{k: list(itertools.chain.from_iterable(list(filter(None.__ne__, [d.get(k) for d in B_dict])))) for k in set().union(*B_dict)}]

print ('Result:',B_dict[0])

Output

Result: {'d': [2], 'c': [0, 2], 'b': [0, 1], 'a': [0, 1]}

The code for the final line in the for loop was borrowed from here: Combine values of same keys in a list of dicts, and remove None value from a list without removing the 0 value

  • You're asking for code optimalisation which is part of Code Review and can be found [here](https://codereview.stackexchange.com/) – ZF007 Apr 11 '19 at 20:17
  • Thanks for the quick response! I have posted in on the Code Review forums as well. https://codereview.stackexchange.com/questions/217288/fastest-way-to-find-dataframe-indexes-of-column-elements-that-exist-as-lists – network_coder Apr 11 '19 at 20:46

3 Answers3

1

explode your list with this method: https://stackoverflow.com/a/46740682/9177877

then groupby and apply list:

idx = np.arange(len(X)).repeat(X['B'].str.len(), 0)
s = X.iloc[idx, ].assign(B=np.concatenate(X['B'].values))['B']
d = s.to_frame().reset_index().groupby('B')['index'].apply(list).to_dict()

# {'a': [0, 1], 'b': [0, 1], 'c': [0, 2], 'd': [2]}

It's pretty quick on 150,000 rows:

# sample data
X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B'])
df = pd.concat([X]*50000).reset_index(drop=True)

%%timeit
idx = np.arange(len(df)).repeat(df['B'].str.len(), 0)
s = df.iloc[idx, ].assign(B=np.concatenate(df['B'].values))['B']
d = s.to_frame().reset_index().groupby('B')['index'].apply(list).to_dict()

# 530 ms ± 46.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
  • Thanks for the solution Chris! Your method is much much faster than my original solution. But there is an even faster approach, as shown by the defaultdict approach of ALollz. So I have accepted that as the solution. Thanks! – network_coder Apr 13 '19 at 14:20
1

I think a defaultdict will work here in about 1 minute:

from collections import defaultdict
from itertools import chain

dd = defaultdict(list)
for k,v in zip(chain.from_iterable(df.B.ravel()), df.index.repeat(df.B.str.len()).tolist()):
    dd[k].append(v)

Output:

defaultdict(list, {'a': [0, 1], 'b': [0, 1], 'c': [0, 2], 'd': [2]})

X = pd.DataFrame([(1, ['a', 'b', 'c']*300), (2, ['a', 'b']*50),
                  (3, ['c', 'd']*34)], columns=['A', 'B'])
df = pd.concat([X]*150000).reset_index(drop=True)

%%timeit
dd = defaultdict(list)
for k,v in zip(chain.from_iterable(df.B.ravel()), df.index.repeat(df.B.str.len()).tolist()):
    dd[k].append(v)
#38.1 s ± 238 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
idx = np.arange(len(df)).repeat(df['B'].str.len(), 0)
s = df.iloc[idx, ].assign(B=np.concatenate(df['B'].values))['B']
d = s.to_frame().reset_index().groupby('B')['index'].apply(list).to_dict()
#1min 24s ± 458 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Thanks a lot for the very insightful answer! The defaultdict methods was much faster than the other methods suggested here. So I have marked it as the correct answer. Thanks! – network_coder Apr 13 '19 at 14:17
  • @network_coder. The answer on the code review that loops twice turns out to be even a bit faster than this, likely because the .str.len() call is slow. Though same general principal. – ALollz Apr 13 '19 at 17:29
0
X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B'])

df = X['B'].apply(pd.Series).T.unstack().reset_index().drop(columns = ['level_1']).dropna()
df.groupby(0)['level_0'].apply(list).to_dict()

I make your B column its own DF, transpose it to make the Index's the columns, unstack it, then finish cleaning it. It looks like:

df

 level_0 0
0   0    a
1   0    b
2   0    c
3   1    a
4   1    b
6   2    c
7   2    d

Then I group by column 0, make it a list, then a dict.

Ben Pap
  • 2,549
  • 1
  • 8
  • 17
  • Thanks Ben for the response. This was also useful, but when the size of the data increases the solution becomes slower. I posted the question here was well: https://codereview.stackexchange.com/questions/217288/fastest-way-to-find-dataframe-indexes-of-column-elements-that-exist-as-lists/217316#217316. Your solution's computation efficiency is compared with the defaultdict solution I have accepted as the fastest solution in this case. Thanks! – network_coder Apr 13 '19 at 14:24