How can I do a string object comparison between two csv files in Pandas (Python 2.7) and generate a new dataframe or csv file

Question

I am reading two .csv files, "CSV_1" & "CSV_2". For the most part, they have different columns and data besides the columns I will be using as my string key.

The elements per row in the four columns (basically my key) need to be identical in both in order to concatenate the data from CSV_2 to the end of each row that matches in CSV_1.

The four columns in both csv files are "Date" (format: Month/Day/Year), "Hour" (format: 0-23), "Make", & "Model" and are str obj datatypes.

It would basically read the 1st row in csv_1 and take those four factors, then look through CSV_2 and find any instance the elements of those four columns. Once it finds a match in both, I want to take all the columns in CSV_1 and CSV_2 and concatenate the columns into a new dataframe without duplicating the "Date", "Hour", "Car", & "Model" columns since they would be identical in both. Isuppose I could delete the duplicate columns from the new dataframe.

I am sure there will be a few instances where there wont be any matches, I would still need that data as well, so I guess I would need to fillna or something to just generate blank cells at the end of the row from CSV_1 for each column added from CSV_2 before adding the row to the new dataframe.

I generated fake data for the example, but it should do something like this with the outputs (except the data is read from the two csv files, I cannot provide a snippet for the actual code)

CSV_1:

import pandas as pd
from pandas import DataFrame

date = ['5/10/2012', '10/17/2012', '1/2/2013', '5/3/2014']
hr = ['1', '0', '23', '13']
make = ['Honda', 'Toyota', 'Chevy', 'Honda']
model = ['Accord', 'Camry', 'Sonic', 'Civic']
gas = ['9', '9', '7','8']
safe = ['8', '10', '6','7']


dataSet = zip(date, hr, make, model, gas, safe)
df = pd.DataFrame(data = dataSet, columns=['Date', 'Hour', 'Make', 'Model', 'Gas Rating', 'Safety Rating'])
>>>df

CSV_2:

make2 = ['Honda', 'Toyota', 'Honda']
model2 = ['Accord', 'Camry', 'Civic']
mile = ['10', '10','9']
speed = ['7', '7', '6']

dataSet2 = zip(date, hr, make2, model2, mile, speed)
df2 = pd.DataFrame(data = dataSet2, columns=['Date', 'Hour', 'Make', 'Model', 'Mileage Rating', 'Speed Rating'])
>>>df2

Here is where the string key comparison should come into play and basically give me the output of the following code (The headers will always be the same, but the amount of data will not, there are actually close to 100+ columns in both)

Final_df:

date = ['5/10/2012', '10/17/2012', '1/2/2013', '5/3/2014']
hr = ['1', '0', '23', '13']
make = ['Honda', 'Toyota', 'Chevy', 'Honda']
model = ['Accord', 'Camry', 'Sonic', 'Civic']
gas = ['9', '9', '7', '8']
safe = ['8', '10', '6', '7']
mile = ['10', '10', ' ','9']
speed = ['7', '7', ' ', '6']

dataSet3 = zip(date, hr, make, model, gas, safe, mile, speed)
df3 = pd.DataFrame(data = dataSet3, columns=['Date', 'Hour', 'Make', 'Model', 'Gas Rating', 'Safety Rating', 'Mileage Rating', 'Speed Rating'])
>>>df3

kennes · Accepted Answer · 2015-07-28T15:23:20.990

1

~~Initially, you requested joining on 4 different columns (Make, Model, Date, Hour), but second table really only had two matching (Make, Model).~~

Outer join will still apply, but the original answer is correct:

In [8]: df
Out[8]: 
         Date Hour    Make   Model Gas Rating Safety Rating
0   5/10/2012    1   Honda  Accord          9             8
1  10/17/2012    0  Toyota   Camry          9            10
2    1/2/2013   23   Chevy   Sonic          7             6
3    5/3/2014   13   Honda   Civic          8             7

In [9]: df2
Out[9]: 
         Date Hour    Make   Model Mileage Rating Speed Rating
0   5/10/2012    1   Honda  Accord             10            7
1  10/17/2012    0  Toyota   Camry             10            7
2    1/2/2013   23   Honda   Civic              9            6

In [10]: final = pd.merge(df,df2, how='outer', on=['Date', 'Hour', 'Make', 'Model'])

In [11]: final
Out[11]: 
         Date Hour    Make   Model Gas Rating Safety Rating Mileage Rating  \
0   5/10/2012    1   Honda  Accord          9             8             10   
1  10/17/2012    0  Toyota   Camry          9            10             10   
2    1/2/2013   23   Chevy   Sonic          7             6            NaN   
3    5/3/2014   13   Honda   Civic          8             7            NaN   
4    1/2/2013   23   Honda   Civic        NaN           NaN              9   

  Speed Rating  
0            7  
1            7  
2          NaN  
3          NaN  
4            6  

In [12]: final.fillna(0, inplace=True)

In [13]: final
Out[13]: 
         Date Hour    Make   Model Gas Rating Safety Rating Mileage Rating  \
0   5/10/2012    1   Honda  Accord          9             8             10   
1  10/17/2012    0  Toyota   Camry          9            10             10   
2    1/2/2013   23   Chevy   Sonic          7             6              0   
3    5/3/2014   13   Honda   Civic          8             7              0   
4    1/2/2013   23   Honda   Civic          0             0              9   

  Speed Rating  
0            7  
1            7  
2            0  
3            0  
4            6

edited Jul 28 '15 at 15:23

answered Jul 27 '15 at 18:25

kennes

2,065
17
20

Did you use `.fillna(0, inplace=True)` on the final table? – kennes Jul 27 '15 at 20:04
Im not sure why it did not work for my code. It may be because my data is being read from csv files. When I ran the code, it basically added all the data from df2 to the bottom of df, but inserted nan values across the rows for each column added and for each column in df that was not in df2 – rsotommx Jul 27 '15 at 20:09
It will be hard for me to answer correctly without a complete example. Is the above output I produced correct for your example? – kennes Jul 27 '15 at 20:12
Yes, the above output is exactly what I am trying to do. Basically I have two csv files generated by different parties. For the most part, most data should match up between the two with the four columns set as the key. the only difference is my dataframes were not created, but I don't see why that would create an issue – rsotommx Jul 27 '15 at 20:28
I suggest you verify you're reading the csv's into `DataFrame` objects correctly. Use `pandas.read_csv()`, print out the columns (`df.columns`), the top 50 results (`df.head(50)`) and see that both `DataFrames` come out as expected. – kennes Jul 27 '15 at 20:56
@rsotommx: I fixed my answer to produce the result you requested. – kennes Jul 27 '15 at 21:31
I apologize, I removed my comment when I saw my error, your initial code was correct, it did exactly what it was supposed to do. I also realized I was given the csv file therefore wrong data. So it found no matches because the data had no match in dates. I am awaiting the right data, but I am pretty sure your initial code should work on my code as well. – rsotommx Jul 27 '15 at 23:07
Okay, let me know if the original answer works and I will update. – kennes Jul 28 '15 at 03:52
Your original answer does exactly what I am trying to get my data to do, which is basically add all the data in one row corresponding to any matches found, as well as entering empty cells for anything that had no match since I need all the data in both dataframes. I think for some reason the keys (which are the column headers spelled the same in both dataframes) do not seem to understand and no matches are being found. I verified there are mathces. Is there a way to create keys for the columns in both dataframes individually before merging them? – rsotommx Jul 28 '15 at 15:02
@rsotommx: Can you post your progress in the question above? If you can't share data, test your code on the small example you provided. – kennes Jul 28 '15 at 15:12
So I have narrowed down the errors and I think I have realized where my bug is. Still not sure how to fix it, but I will post a different question addressing it. I will add the link if you would like to look at it, but your answer works for this querstion – rsotommx Jul 29 '15 at 19:11
Here is a link to my new question if you would like to see my update: http://stackoverflow.com/questions/31710688/pandas-left-merging-date-keys-with-different-date-formats-not-timestamps – rsotommx Jul 29 '15 at 20:43

How can I do a string object comparison between two csv files in Pandas (Python 2.7) and generate a new dataframe or csv file

1 Answers1