1

This is the code I have until now:

import pandas as pd
import pubchempy
import numpy as np

df = pd.read_csv("Data.tsv.txt", sep="\t")

from pubchempy import get_properties

df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('.0',''))
df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('0',''))

df = df.drop(df[df.CID=='nan'].index)

df = df.drop( df.index.to_list()[150:] ,axis = 0 )


df['CID']= df['CID'].map(lambda x: get_properties(identifier=x, properties='MolecularWeight') if float(x) > 0 else pd.NA)

print(df)

The output that I'm getting under the 'CID' column is this:

CID

[{'CID': 5339, 'MolecularWeight': '398.4'}]

What can I do so that I only get the numerical 'MolecularWeight' value in the 'CID' column (eg. 398.4 in column one etc)?

  • Can you add output of `print(df)` – mujjiga May 19 '22 at 21:50
  • ```df['CID']= df['CID'].map(lambda x: float(get_properties(identifier=x, properties='MolecularWeight')['MolecularWeight']) if float(x) > 0 else pd.NA) ``` – mujjiga May 19 '22 at 21:54
  • I am a bit confused. You don't need to `lambda` to `replace`, and you can cascade them like `str.replace('.0','').replace(('0','')`. BTW, if there is any zero anywhere in your CID it will just remove it. Also, you are comparing to `"nan"` when it is much easier&faster to compare it to `nan`. **Is [{'CID': 5339, 'MolecularWeight': '398.4'}] one of the entries of `df["CID"]` after read_csv?** – Zaero Divide May 19 '22 at 22:26
  • No [{'CID': 5339, 'MolecularWeight': '398.4'}] is one of the entries after df['CID']= df['CID'].map(lambda x: get_properties(identifier=x, properties='MolecularWeight') if float(x) > 0 else pd.NA). I used lambda because in order to use str.replace('.0','').replace(('0','') I would have to provide a known string value. What my code does is that it uses the pubchempy.get_properties to search through pubchem and return the molecular weight of a compound given a specific identification value (the values in the 'CID' column). – New_to_coding May 19 '22 at 22:49
  • `df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('.0',''))` + `df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('0',''))` is actually the same as doing `df['CID'] = df['CID'].astype(str).str.replace('(.)?0',"")`. BTW, if you don't @ us, we don't see that you replied – Zaero Divide May 19 '22 at 22:57
  • @ZaeroDivide Oh damn lol, didn't even know you could @ someone, thank you. Wait so are you saying I could combing the pubchempy.get_properties function with the df['CID'] = df['CID'].astype(str).str.replace('(.)?0',"") function? – New_to_coding May 19 '22 at 23:33
  • @mujjiga I tried to add an image of the output but my reputation isn't high enough unfortunately... But the column output is the same as what I posted. The code you worked also didn't work, the error says, "TypeError: list indices must be integers or slices, not str". – – New_to_coding May 20 '22 at 02:40
  • @New_to_cofing Give full dataframe or at least a part of it. Use df_to_dict and copy/paste here, like this we can reproduce the error in our IDE. – Drakax May 27 '22 at 22:02

0 Answers0