I have two dataframes. The first has only two columns, and N rows. N is hundreds to thousands. Each column is a molecules name, thus, it is a dataframe of pairs of molecules.
Second dataframe: I have a dataframe that is 1600 columns and M rows. M < N. Each column has a descriptor of a molecule. Thus, each molecule has 1600 descriptors.
Given these two dataframes, I want to create a 3rd dataframe that has 3200 columns (1600*2) and N rows. For each pair of molecules, I want to have the 1600 descriptors of the first molecules, followed (concatenated) by the 1600 descriptors of the second molecule.
So, I will have a new dataframe with 3200 descriptors for each pair of molecules.
Is there a pandas
way to combine columns from different DataFrames
? my MWE only works for my little example.
I have a MWE, however, when I try using it on the real dataframes, I get this error (diclofenac is the name of the molecule - the equivalent of a
, b
, or c
in the MWE)
Traceback (most recent call last):
File "/apps/psi4conda/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'diclofenac'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "ml_script.py", line 232, in <module>
matrix.append(pd.concat([cof_df.loc[row['cof1']], cof_df[row['cof2']]], axis=0))
File "/apps/psi4conda/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__
indexer = self.columns.get_loc(key)
File "/apps/psi4conda/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
raise KeyError(key) from err
KeyError: 'diclofenac'
Here is the MWE
import numpy as np
import pandas as pd
# Dataframe with each molecules descriptors (real and binaries allowed)
df1 = pd.DataFrame([['a',1,True,3,4], ['b',55,False,76,87],['c',9,True,11,12]], columns=["name", "d1", "d2", "d3", "d4"])
df1 = df1.set_index("name")
# dataframe of pairs of molecules
df2 = pd.DataFrame({'cof1':['a', 'a','c','b'], 'cof2':['c','b','a','c']})
matrix = []
for index, rows in df2.iterrows():
matrix.append(pd.concat([df1.loc[rows['cof1']], df1.loc[rows['cof2']]], axis=0))
matrix = np.asarray(matrix)
df3 = pd.DataFrame(matrix)
The thing I don't get, is that it will successfully print to screen df1.loc[rows['cof1']]
, so it has no issues with the key
in this call.