0

I'm currently trying to write a program that takes a chemical compound's identifier (something called the CID number) and then gives back the compound's properties by using the pubchempy documentation.

However, I keep getting an error when I try to merge the data values that I get from pubchempy to the initial database.

This is the code I've written for now:

import pandas as pd
import pubchempy
import numpy as np

df = pd.read_csv("Data.tsv.txt", sep="\t")

from pubchempy import get_properties

df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('.0',''))
df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('0',''))

df = df.drop(df[df.CID=='nan'].index)

df = df.drop(labels='reference', axis=1)
df = df.drop(labels='group', axis=1)
df = df.drop(labels='comments', axis=1)
df = df.drop(labels='compound_name', axis=1)

props = ['HBondDonorCount', 'RotatableBondCount', 'MolecularWeight', 'HBondAcceptorCount']
df2 = pd.DataFrame(get_properties(identifier=df.CID.to_list(), properties=props))

df = df.merge(df2)

print(df)

However, I get an error message that says,

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

Does anyone know how to fix this?


Few lines of text file (datafile):

NO. compound_name   IUPAC_name  SMILES  CID     Inchi   threshold   reference   group   comments
1   sulphasalazine  2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]diazenyl]benzoic acid O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O    5339    InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18(24)25)21-20-12-4-7-14(8-5-12)28(26,27)22-17-3-1-2-10-19-17/h1-11,23H,(H,19,22)(H,24,25)      R2|R2|R25|R46|  A   
2   moxalactam  7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]-7-methoxy-3-[(1-methyltetrazol-5-yl)sulfanylmethyl]-8-oxo-5-oxa-1-azabicyclo[4.2.0]oct-2-ene-2-carboxylic acid   COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)O)=C(CSc3nnnn3C)COC21 3889    InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8-10-7-35-18-20(34-2,17(33)26(18)13(10)16(31)32)21-14(28)12(15(29)30)9-3-5-11(27)6-4-9/h3-6,12,18,27H,7-8H2,1-2H3,(H,21,28)(H,29,30)(H,31,32)      R25|    A   
3   clioquinol  5-chloro-7-iodoquinolin-8-ol    Oc1c(I)cc(Cl)c2cccnc12  2788    InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1-3-12-8/h1-4,13H      R18|R26|R27|    A   

Few lines of df2 output:

        CID MolecularWeight  HBondDonorCount  HBondAcceptorCount  RotatableBondCount
0      5339           398.4                3                   9                   6
1      3889           520.5                4                  13                   9
2      2788          305.50                1                   2                   0
3   1422517           440.5                0                   8                   4
4  18595497           461.5                5                  10                   3
wjandrea
  • 28,235
  • 9
  • 60
  • 81
  • What are the indexes of the two dataframes? By Default it merges on indexes of the two DF. you can specify the columns to merge on. Refer https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html – Naveed Jun 04 '22 at 15:29
  • Beside the point, but why do you do multiple `df.drop(labels=..., axis=1)`? You could do them all together, plus you could use *columns* instead: `df = df.drop(columns=['reference', 'group', 'comments', 'compound_name'])` – wjandrea Jun 04 '22 at 15:40
  • 1
    What is the desired output, for the few lines of input shown? (If that does not make sense, please try to construct a small dataset that can be used to test the code.) How do you intend for the merge to work? – Karl Knechtel Jun 04 '22 at 15:45
  • The formatting of your TSV has gotten messed up since Stack Overflow converts tabs to spaces. It'd be better to post a [reproducible pandas example](/q/20109391/4518341). (See also [mre].) – wjandrea Jun 04 '22 at 15:46

1 Answers1

0

It seems you want to merge two dataframes on CID column. CID column type of df2 is int, you need change it to object to match the type of CID in df

df = df.merge(df2.astype({'CID': str}), on='CID')
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52