I have a data of 500k rows and the formatting of the whole data is kinda inconsistent I'm using Spyder, pandas to do the cleaning of data
I will have a column that consists of numbers or string. I would like to delete the entire row if that particular cell is in string
As shown below is my code with some adjustment due to confidential info
import pandas as pd
import csv
mydataset = pd.read_csv('test.txt', error_bad_lines=False,
engine='python',
index_col=False,header = None,quoting=csv.QUOTE_NONE,
sep="[\s|,|/]",names=["1","2","3","4","a","b","c",
"h","i","j","k","l","m","n","o","p","f","g",
"q","r","s","t","u","v","w","x","y","z",
"5","6","7","8","9","10","11","12","13","14"])
print (mydataset.shape)
columns =['3','4','h','a','b','c','i','j','k','l','m','n','f','g']
mydataset.drop(columns,inplace=True,axis=1)
print (mydataset.shape)
mydataset = mydataset[(mydataset.q.notnull())&(mydataset.r.notnull())&
(mydataset.s.notnull())&(mydataset.2.notnull())&(mydataset.2 != "@")]
Pardon the naming convention of the header.
example of data:
1 2 3 4 <--header
abc 123 123 bcd <--Data
123 123 123 bcd <--Data
would like to detect the "abc" and remove the whole row
Please advice!