1

I'm trying to convert some categorical values from a defaultdict(list) into columns of a pandas dataframe. For example, here is the dict I have:

{"user1": ["id1", "id2"], "user2": ["id2", "id3"]} 

and the expected output is having user1 and user2 as rows, and id1, id2, id3 as columns and the value is 1 if that id appeared in the user's list and 0 otherwise.

I have created a dictionary and use a nested for loop to go through the unique user and ids and create the output but this is really slow. I was wondering what is a more efficient way of doing this?

pault
  • 41,343
  • 15
  • 107
  • 149
Keyang Zhang
  • 123
  • 2
  • 10

3 Answers3

0

Why not use the Pandas built-in from_dict?

data = {"user1": ["id1", "id2"], "user2": ["id2", "id3"]}

df = pd.DataFrame.from_dict(data)

df
  user1 user2
0   id1   id2
1   id2   id3

Or if you want rows:

data = {"user1": ["id1", "id2"], "user2": ["id2", "id3"]}

df = pd.DataFrame.from_dict(data, orient='index')

df
         0    1
user2  id2  id3
user1  id1  id2
jxmorris12
  • 1,262
  • 4
  • 15
  • 27
  • Thanks but I mean having id1, id2 and id3 as column names? The resulting df has two rows and three columns in this case. – Keyang Zhang May 06 '19 at 17:11
0

please try this

import pandas as pd

data = {"user1": ["id1", "id2"], "user2": ["id2", "id3"]}

rows = []
cols = [] 

for key, val in data.items() : 
      for v in val :  
        cols.append(v)

cols = list(set(cols)) 

df = pd.DataFrame(columns=cols) 

print(df)

#rows.append(key)

for key, val in data.items() : 
          row = [] 
          for col in cols : 
            if col not in val : 
              row.append("nan")
             #dft = pd.DataFrame(["NaN"],columns=col)
            else :
              row.append(key)
             #dft = pd.DataFrame([key],columns=col)
          dft = pd.DataFrame([row],columns=cols)


          df = df.append(dft)

df = df.reindex(sorted(df.columns), axis=1)
print(df)

output

     id1    id2    id3
0  user1  user1    nan
0    nan  user2  user2
nassim
  • 1,547
  • 1
  • 14
  • 26
0

Your desired output is not entirely clear, but from my understanding here's a solution without loops, in pure pandas. If this is what you are after, I'd recommend viewing the result of each step (provided in multi-line format for easy commenting)

Based on the new information provided in the comments, for a dictionary with different length values (adapted from this question):

d={"user1": ["id1", "id2", "id3"], "user2": ["id2", "id3"], "user3":["id1"]}
df=pd.DataFrame.from_dict(d, orient='index')    
df
        0   1   2
user1   id1 id2 id3
user2   id2 id3 None
user3   id1 NoneNone

pd.get_dummies(df.unstack())\
.reset_index()\
.drop('level_0', axis=1)\
.groupby('level_1')\
.sum()

        id1 id2 id3
level_1         
user1   1   1   1
user2   0   1   1
user3   1   0   0
G. Anderson
  • 5,815
  • 2
  • 14
  • 21
  • Thanks but I should have mentioned in the question.. the number of elements in each list might be different. For example: `{"user1": ["id1", "id2", "id3"], "user2": ["id2", "id3"]}` – Keyang Zhang May 06 '19 at 21:48