-1

I have two data frames:

Dataframe 1 (project data)

project_data = pd.DataFrame({'record_id' : ('1','2','3'), 'account_number' : ('123','124','125'), 'city' :('Vancouver','Miami','Dublin'),'salary' :(70000,80000,65000)})

Dataframe 2 (project data dictionary)

data_dict = pd.DataFrame({'variable_name' : ('record_id','account_number','city','salary'), 'form_name' : ('crosswalk_form','crosswalk_form','demographics','demographics')})

I want to be able to bring in the form_name variable from Dataframe 2 (data dictionary) but don't know how to go about this as there is no common column to join between the data frames.

I was hoping to ask is there a way to join from the column headers in Dataframe 1 (project data) to the column variable_name in Dataframe 2 (data dictionary). or would I need to reshape one of the data frames (possibly Dataframe 1)?

Possible output

merged_data = pd.DataFrame({'record_id' : ('1','2','3'), 'account_number' : ('123','124','125'), 'city' :('Vancouver','Miami','Dublin'),'salary' :(70000,80000,65000),'form_name' : ('crosswalk_form','crosswalk_form','demographics')})

I want this form_name column to categorize the variables for some descriptive statistics I want to do further along. Thanks.

Eoin Vaughan
  • 121
  • 1
  • 10
  • Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) – noah Nov 24 '20 at 17:42
  • The `form_name` column in `merged_data` is logically WRONG because `form_name` should be associated with specific columns, not rows. This question thus appears to be unsalvageable through editing. The OP must rethink the use case and define a logically feasible output spec. – Bill Huang Nov 24 '20 at 17:47
  • @BillHuang yes this is correct, I don't have a common key across data frames. I was just wondering was there a way to reshape the data (Dataframe 1) so I could maybe do this merge on the reshaped dataframe and rename the column of the reshaped data frame to be the same as the other data frame (Dataframe 2) so I could do it this way. Thanks. – Eoin Vaughan Nov 24 '20 at 17:55
  • Where did the last row in DF 2 ( `salary demographics`) go in your possible output? Vanished for no reason? – Bill Huang Nov 24 '20 at 18:05
  • @BillHuang DF 2 is not data but in fact metadata describing the project data were collecting. I just wanted to know was there a way we could leverage this data to help categorize fields based on `form_name`. Thnaks for the solution below. I must test whether I can use the new `MultiIndex`. I'm new to this type of work, Multindex is useful for what in particular, examples from your work? grouping columns/rows together? – Eoin Vaughan Nov 24 '20 at 18:18
  • That will depend on your use case in a broader scheme -- but getting the right picture of your business/work logic is unfortunately out of the scope of SO. You may post a separate question with the new scenario when you're done. – Bill Huang Nov 24 '20 at 18:36

1 Answers1

0

A new MultiIndex can be constructed to replace the original columns directly. (Assume the columns are arranged in the same order of the second dataframe).

project_data.columns = pd.MultiIndex.from_tuples([
    (tup.form_name, tup.variable_name) for tup in data_dict.itertuples(index=False)
])

Output

print(project_data)
  crosswalk_form                demographics       
       record_id account_number         city salary
0              1            123    Vancouver  70000
1              2            124        Miami  80000
2              3            125       Dublin  65000

This literally accomplished what the OP asked for. However, whether such replacement can serve the actual use case well or not remains to be clarified.

Bill Huang
  • 4,491
  • 2
  • 13
  • 31