I have a large excel file to clean around 200000 rows. So Im using pandas to drop unwanted rows if the conditions meet but it takes some time to run.
My current code looks like this
def cleanNumbers(number): # checks number if it is a valid number
vaild = True
try:
num = pn.parse('+' + str(number), None)
if not pn.is_valid_number(num):
vaild = False
except:
vaild = False
return vaild
for UncleanNum in tqdm(TeleNum):
valid = cleanNumbers(UncleanNum) # calling cleanNumbers function
if valid is False:
df = df.drop(df[df.telephone == UncleanNum].index)
# dropping row if number is not a valid number
It takes around 30 min for this line of code to finish. Is there a more efficient way to drop rows with pandas? If not can I use numpy to have the same output?
Im not that aquainted with pandas or numpy so if you have any tips to share it would be helpful.
Edit:
Im using phonenumbers lib to check if the telephone number is valid. If its not a valid phonenumber i drop the row that number is on.
Example data
address name surname telephone
Street St. Bill Billinson 7398673456897<--let say this is wrong
Street St. Nick Nick 324523452345
Street St. Sam Sammy 234523452345
Street St. Bob Bob 32452345234534<--and this too
Street St. John Greg 234523452345
Output
address name surname telephone
Street St. Nick Nick 324523452345
Street St. Sam Sammy 234523452345
Street St. John Greg 234523452345
This is what my code does but it slow.