1

This is a very basic question but I am stuck. I have a dataframe that looks something like this:

  key  file
0 1234 abc.pdf
1 1235 ghi.pdf
2 1234 def.pdf
3 1235 jkl.pdf
4 1235 lmn.pdf

There are a variable number of documents associated with each key. I would like to transform this to something like this:

  key  doc_1      doc_2      doc_3
0 1234 abc.pdf    def.pdf    NaN
1 1235 ghi.pdf    jkl.pdf    lmn.pdf

If I try to use df.pivot, I get a new column named for each document name, which is not what I want. I'm at a loss to figure out the pandas-appropriate way to do this. I have reviewed Reshaping and pivot tables — pandas 1.1.0 documentation but haven't found an answer probably because I don't yet 'get' Pandas.

I can do it in raw Python, generating a dictionary that can be fed to pandas, but I'm positive this isn't the right way to go.

import csv 

data = []
temp_dict = {}
final_dict = {}

with open('output_records.csv') as f:
    csvreader = csv.DictReader(f)
    for row in csvreader:
        data.append(row)

for row in data:
    if row['key'] not in temp_dict:
        temp_dict[row['key']] = list([row['file']])
    else:
        temp_dict[row['key']].append(row['file'])

for item in temp_dict:
    value_dict = {}
    for counter, value in enumerate(temp_dict[item]):
        key = 'doc_' + str(counter)
        value_dict[key] = value
    final_dict[item] = value_dict

Thank you in advance for any suggestions.

Ben

BenC
  • 35
  • 4

4 Answers4

0

You can achieve something similar creating set (in my opinion creating new columns is not the best way to store these data in pandas.DataFrame).

import pandas as pd

keys = [1234, 1235, 1234, 1235, 1235]
files = ["abc.pdf","ghi.pdf","def.pdf","jkl.pdf","lmn.pdf"]

df = pd.DataFrame(zip(keys, files), columns=['key','file'])

def to_set(x):
    return set(x)

df.groupby('key').agg({'file': to_set})
key     file
1234    {abc.pdf, def.pdf}
1235    {lmn.pdf, jkl.pdf, ghi.pdf}

Here is explanation why you need to create to_set(): Pandas groupby and make set of items

  • Michael, thank you. I agree with you - but in this case, the end user requires the doc_1, doc_2, etc. column style for other reasons. – BenC Aug 20 '20 at 11:33
0

Try groupby.apply with unstack:

(df.groupby('key')['file']
  .apply(lambda x: pd.Series(x.tolist(),index=['doc_'+str(i+1) for i in range(x.shape[0])]))
  .unstack())
Space Impact
  • 13,085
  • 23
  • 48
0

Use Pandas str.split to get the output :

(df
 .groupby("key") 
 .agg(",".join)
 .file.str.split(",", expand=True)
 .rename(columns=lambda x: f"doc_{x + 1}")
 .reset_index())

    
     key    doc_1   doc_2   doc_3
0   1234    abc.pdf def.pdf None
1   1235    ghi.pdf jkl.pdf lmn.pdf
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
0

Groupby key, send the file to list to create a dataframe. drop unrequired level and rename the columns using add_prefix

df.groupby('key')['file'].apply(lambda x: pd.DataFrame(x.tolist()).T).reset_index(level=1, drop=True).add_prefix('doc_').reset_index()



 key    doc_0    doc_1    doc_2
0  1234  abc.pdf  def.pdf      NaN
1  1235  ghi.pdf  jkl.pdf  lmn.pdf
wwnde
  • 26,119
  • 6
  • 18
  • 32