I have a pandas dataframe in which the column values exist as lists. Each list has several elements and one element can exist in several rows. An example dataframe is:
X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B'])
X =
A B
0 1 [a, b, c]
1 2 [a, b]
2 3 [c, d]
I want to find all the rows, i.e. dataframe indexes, corresponding to elements in the lists, and create a dictionary out of it. Disregard column A here, as column B is the one of interest! So element 'a' occurs in index 0,1, which gives {'a':[0,1]}. The solution for this example dataframe is:
Y = {'a':[0,1],'b':[0,1],'c':[0,2],'d':[2]}
I have written a code that works fine, and I can get a result. My problem is more to do with the speed of computation. My actual dataframe has about 350,000 rows and the lists in the column 'B' can contain up to 1,000 elements. But at present the code is running for several hours! I was wondering whether my solution is very inefficient. Any help with a faster more efficient way will be really appreciated! Here is my solution code:
import itertools
import pandas as pd
X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B'])
B_dict = []
for idx,val in X.iterrows():
B = val['B']
B_dict.append(dict(zip(B,[[idx]]*len(B))))
B_dict = [{k: list(itertools.chain.from_iterable(list(filter(None.__ne__, [d.get(k) for d in B_dict])))) for k in set().union(*B_dict)}]
print ('Result:',B_dict[0])
Output
Result: {'d': [2], 'c': [0, 2], 'b': [0, 1], 'a': [0, 1]}
The code for the final line in the for loop was borrowed from here: Combine values of same keys in a list of dicts, and remove None value from a list without removing the 0 value