0

I have a csv file that looks like this:

CCC1=NNC(CC)=C1CC(=O)N[C@H]1C[C@@](C)(O)C1  Z3670357625 M
COC(=O)C1(CNC(=O)NCC2CC2(C)C)CCCC1  Z1366118392 M
CC1(C)CN(CCF)CCN1S(=O)(=O)C1=CNC=N1 Z3847735140 S
CC1=CC(N2CCC(C#N)(C3CCC3)CC2)=NC=C1C#N  Z4174841496 M
CN(CC1=NN=C2C=CC=CN12)C(=O)C1=CC=CC=C1Br    Z979185880  S
CC1=CC(C(NCC(C)N(C)C2CC2)C2CC2)=CC=C1F  Z1586839055 S
CN1N=NC2=CC(CSCC3=CC=C(Br)C=C3)=CC=C21  Z1449657345 S
CCC(C)(C)C(=O)N(C)C1=CC=CC(F)=C1OC  Z4888674994 S
CCC(NC1CC2CCCCN2C1=O)C1=NOC(C)=C1   Z4196015222 M
COCC1(CO)CCN(CC2=CC=CC(C(=O)O)=C2Cl)C1  Z4251400790 M
O=C(C[C@H]1CCNC1)N1CCC(OC2=CC=C(Cl)C=C2)C1  Z4062618130 S
O=C(C[C@@H]1CCNC1)N1CCC(OC2=CC=C(Cl)C=C2)C1 Z4062618465 S
CC1(C)CN(C2CCN(C(=O)OC3COC3)C2)C1   Z4000118072 M
CCC(CC)C(C)C(=O)OCC1=CC(=O)NC=C1    Z2861190319 S
COC1=CC(Br)=C(CN[C@@H]2C[C@@H]3C[C@H](O)[C@H]2C3)C=C1OC |&1:9,13,&2:11,15|  Z3808914614 S

The last line contains characters within ||, which leaves it with four columns instead of the three that the other rows have. How do I delete just |&1:9,13,&2:11,15|? I saw this post: Deleting part of a string pandas DataFrame but I need to delete just this selection, not the entire row starting from this point. I feel this would be easier than trying to just retain the three columns I need throughout the csv, but am open to other ideas. I will need to do this for a number of lines in a very large file. Thanks!

Christian Seitz
  • 728
  • 6
  • 15

2 Answers2

1

Assuming that your content is in a csv file called csvfile.csv try the following:

  1. Code to replace the offending string
#read input file
fin = open("csvfile.csv", "rt")
#read file contents to string
data = fin.read()
#replace all occurrences of the required string
data = data.replace('|&1:9,13,&2:11,15|', '')
#close the input file
fin.close()
#open the input file in write mode
fin = open("csvfile.csv", "wt")
#overrite the input file with the resulting data
fin.write(data)
#close the file
fin.close()

after this you should be able to do:

  1. Code to simply read the csv
import pandas as pd
df = pd.read_csv("csvfile.csv")
lytseeker
  • 305
  • 4
  • 10
0

A much faster way is to skip python altogether. From your terminal:

sed 's/|&1:9,13,&2:11,15|//g' path_to_csv > path_to_modified_csv

This will remove exclusively the undesired part, e.g. (last line):

COC1=CC(Br)=C(CN[C@@H]2C[C@@H]3C[C@H](O)[C@H]2C3)C=C1OC   Z3808914614 S
fsl
  • 3,250
  • 1
  • 10
  • 20