1

I have this Data Science problem where I need to create a test set using info provided in two csv files.

Problem

data1.csv

cat,In1,In2
aaa, 0, 1
aaa, 2, 1
aaa, 2, 0
aab, 3, 2
aab, 1, 2

data2.csv

cat,index,attribute1,attribute2
aaa, 0, 150, 450
aaa, 1, 250, 670
aaa, 2, 30, 250
aab, 0, 60, 650
aab, 1, 50, 30
aab, 2, 20, 680
aab, 3, 380, 250

From these two files what I need is a updated data1.csv file. Where in place of In1 and In2, I need the attributes of the specific indices(In1 and In2), under a specific category (cat).

Note: All the indices in a specific category (cat) have their own attributes.

Result should look like this,

updated_data1.csv

cat,In1a1,In1a2,In2a1,In2a2
aaa, 150, 450, 250, 670
aaa, 30, 250, 250, 670
aaa, 30, 250, 150, 450
aab, 380, 250, 20, 680
aab, 50, 30, 20, 680

I need an approach to tackle this problem using pandas in python. So far I have loaded the csv files in to my jupyter notebook. And I have no clue where to start.

Please note this is my first week using python for data manipulation and I have a very little knowledge on python. Also pardon me for ugly formatting. I'm using the mobile phone to type this question.

slhulk
  • 123
  • 4
  • 2
    Look into and research ```pd.merge()``` and the different types of merges in general. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html – Ben Pap Jun 05 '19 at 17:50

1 Answers1

0

As others have suggested, you can use pd.merge. In this case, you need to merge on multiple columns. Basically you need to define which columns from the left DataFrame (here data1) map to which columns from the right DataFrame (here data2). Also see pandas merging 101.

# Read the csvs
data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv')
# DataFrame with the in1 columns
df1 = pd.merge(left=data1, right=data2, left_on = ['cat','In1'], right_on = ['cat', 'index'])
df1 = df1[['cat','attribute1','attribute2']].set_index('cat')
# DataFrame with the in2 columns
df2 = pd.merge(left=data1, right=data2, left_on = ['cat','In2'], right_on = ['cat', 'index'])
df2 = df2[['cat','attribute1','attribute2']].set_index('cat')
# Join the two dataframes together.
df = pd.concat([df1, df2], axis=1)
# Name the columns as desired
df.columns = ['in1a1', 'in1a2', 'in2a1', 'in2a2']

One should generally try to avoid iterating through DataFrames, because it's not very efficient. But it's definitely a possible solution here.

# Read the csvs
data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv')
# This list will be the data for the resulting DataFrame
rows = []
# Iterate through data1, unpacking values in each row to variables
for idx, cat, in1, in2 in data1.itertuples():
    # Create a dictionary for each row where the keys are the column headers of the future DataFrame
    row = {}
    row['cat'] = cat
    # Pick the correct row from data2
    in1 = (data2['index'] == in1) & (data2['cat'] == cat)
    in2 = (data2['index'] == in2) & (data2['cat'] == cat)
    # Assign the correct values to the keys in the dictionary 
    row['in1a1'] = data2.loc[in1, 'attribute1'].values[0]
    row['in1a2'] = data2.loc[in1, 'attribute2'].values[0]
    row['in2a1'] = data2.loc[in2, 'attribute1'].values[0]
    row['in2a2'] = data2.loc[in2, 'attribute2'].values[0]
    # Append the dictionary to the list
    rows.append(row)
# Construct a DataFrame from the list of dictionaries
df = pd.DataFrame(rows)
pnovotnyq
  • 547
  • 3
  • 12
  • Thank you for the lengthy explanation. However when I try to code this to my original problem an error returns saying "too many values to unpack (expected 4)" .. whats the reason behind this? Also, what does **idx** refer to in your code? – slhulk Jun 05 '19 at 21:16
  • @user6237796 `df.itertuples()` iterates through the DataFrame and creates a tuple that contains the index and all the values for each row. That error means that the tuples from your `data1` have more than 4 values, which is the number of variables that you tried to unpack to... I used `idx` to refer to the numeric index generated by pandas when I read the csv (because I didn't pass `index_col`), but it's not important here (I don't use it anywhere). I could have used a dummy variable or indexed `data1` by `cat`, for example. – pnovotnyq Jun 05 '19 at 21:36
  • 1
    Sure! [No need to say thanks](https://stackoverflow.com/help/someone-answers). Accepting the answer is enough to say it worked for you. Later, you will also be able to [upvote](https://stackoverflow.com/help/privileges/vote-up) helpful questions and answers to push content higher. – pnovotnyq Jun 05 '19 at 21:49