0

I have two data frames.

df1:

filename|data
fileA|1
fileB|33
fileC|343

df2:

path|filesize|filetype
/tmp/fileA.csv|123|csv
/tmp/fileB.csv|123|csv
/tmp/fileC.csv|3534|csv
/tmp/fileD.csv|234|csv

I want the result to be

filename|data|path
fileA|1|/tmp/fileA.csv
fileB|33|/tmp/fileB.csv
fileC|343|/tmp/fileC.csv
fileD|3243|/tmp/fileD.csv

This seems extremely simple but I can't seem to get it to work with .assign(). I need to match each row that is in df1.filename with what is in df2.filepath and then add df1['filepath'] to df1.

I tried the following, but it complains that Series is not "hashable"

df1.assign(path = lambda x: df2[df2.path.str.contains(x.filename + ".csv")][path])

{TypeError}'Series' objects are mutable, thus they cannot be hashed

I tested to make sure my df1.assign() was correct by doing

df1.assign(path = lambda x: x.filename)

and it worked and just appended the filename on the df1 (which is what I would expect).

I'm assuming that the problem area is the `contains(x.filename + ".csv") being a "Series". If I change it to x.filename.values I then get

{TypeError}unhashable type: 'numpy.ndarray'. I don't understand what "x" is. I assume its a Series object, but no idea how to tell which "row" its associated with if it is.

I could brute-force this and just loop over df1 but df1 is 2M+ records and loops seem to be generally frowned upon for performance reasons with pandas. Can someone point me into what I'm doing wrong?

Kevin Vasko
  • 1,561
  • 3
  • 22
  • 45
  • 1
    Create a new col in df2 - `df2['filename']=df2['path'].apply(lambda x: x.split('.')[0].split('/')[-1])` then merge - `df3=df1.merge(df2,on='filename')` then you can subset out the columns – Vivek Kalyanarangan Jan 17 '18 at 04:55
  • @VivekKalyanarangan I would really like to know how to do an actual "search" based on df1 values and add the column if possible, as I'll have other "partial" matches where it won't be near as easy to split (in fact in this case I cant simply split, it was just for an example). – Kevin Vasko Jan 17 '18 at 05:02
  • Look at this - https://stackoverflow.com/a/13680953/4098013 – Vivek Kalyanarangan Jan 17 '18 at 05:32

1 Answers1

0

IIUC, I think you want to use str accessor and extract with a regex to pull filename from path and merge on filename:

df2.assign(filename=df2.path.str.extract(r'(\w+)\.csv', expand=True))\
   .merge(df1, on='filename')

Output:

             path  filesize filetype filename  data
0  /tmp/fileA.csv       123      csv    fileA     1
1  /tmp/fileB.csv       123      csv    fileB    33
2  /tmp/fileC.csv      3534      csv    fileC   343
Scott Boston
  • 147,308
  • 15
  • 139
  • 187