I am reading two .csv files, "CSV_1" & "CSV_2". For the most part, they have different columns and data besides the columns I will be using as my string key.
The elements per row in the four columns (basically my key) need to be identical in both in order to concatenate the data from CSV_2 to the end of each row that matches in CSV_1.
The four columns in both csv files are "Date" (format: Month/Day/Year), "Hour" (format: 0-23), "Make", & "Model" and are str obj datatypes.
It would basically read the 1st row in csv_1 and take those four factors, then look through CSV_2 and find any instance the elements of those four columns. Once it finds a match in both, I want to take all the columns in CSV_1 and CSV_2 and concatenate the columns into a new dataframe without duplicating the "Date", "Hour", "Car", & "Model" columns since they would be identical in both. Isuppose I could delete the duplicate columns from the new dataframe.
I am sure there will be a few instances where there wont be any matches, I would still need that data as well, so I guess I would need to fillna or something to just generate blank cells at the end of the row from CSV_1 for each column added from CSV_2 before adding the row to the new dataframe.
I generated fake data for the example, but it should do something like this with the outputs (except the data is read from the two csv files, I cannot provide a snippet for the actual code)
CSV_1:
import pandas as pd
from pandas import DataFrame
date = ['5/10/2012', '10/17/2012', '1/2/2013', '5/3/2014']
hr = ['1', '0', '23', '13']
make = ['Honda', 'Toyota', 'Chevy', 'Honda']
model = ['Accord', 'Camry', 'Sonic', 'Civic']
gas = ['9', '9', '7','8']
safe = ['8', '10', '6','7']
dataSet = zip(date, hr, make, model, gas, safe)
df = pd.DataFrame(data = dataSet, columns=['Date', 'Hour', 'Make', 'Model', 'Gas Rating', 'Safety Rating'])
>>>df
CSV_2:
make2 = ['Honda', 'Toyota', 'Honda']
model2 = ['Accord', 'Camry', 'Civic']
mile = ['10', '10','9']
speed = ['7', '7', '6']
dataSet2 = zip(date, hr, make2, model2, mile, speed)
df2 = pd.DataFrame(data = dataSet2, columns=['Date', 'Hour', 'Make', 'Model', 'Mileage Rating', 'Speed Rating'])
>>>df2
Here is where the string key comparison should come into play and basically give me the output of the following code (The headers will always be the same, but the amount of data will not, there are actually close to 100+ columns in both)
Final_df:
date = ['5/10/2012', '10/17/2012', '1/2/2013', '5/3/2014']
hr = ['1', '0', '23', '13']
make = ['Honda', 'Toyota', 'Chevy', 'Honda']
model = ['Accord', 'Camry', 'Sonic', 'Civic']
gas = ['9', '9', '7', '8']
safe = ['8', '10', '6', '7']
mile = ['10', '10', ' ','9']
speed = ['7', '7', ' ', '6']
dataSet3 = zip(date, hr, make, model, gas, safe, mile, speed)
df3 = pd.DataFrame(data = dataSet3, columns=['Date', 'Hour', 'Make', 'Model', 'Gas Rating', 'Safety Rating', 'Mileage Rating', 'Speed Rating'])
>>>df3