1

I have tried a number of things and just can't seem to get something that works. Basically I have an XLSX file that has 3 columns (Name, Email, Phone Number) and then a number of rows. The phone numbers start with different country codes "+1.xxxxxxxxxx" "+90.xxxxxxxxxx" "+34." etc. The first thing I want to do is delete all rows that don't start with "+1." and would like to further filter by deleting obviously fake and/or incomplete phone number, for instance, sometimes people use '+1.5555555555' or don't give me complete numbers, so I would get '+1.12345678' (not a full 10 digit number, 13 if you count '+1.'). Then after all that is done, write it to .csv.

My code thus far is as follows:

import pandas as pd

xl_file = pd.ExcelFile('testexcel.xlsx')
df = xl_file.parse('Sheet 1', index_col='Name', na_values=['NA'])
#df.drop(df.columns[[0]], axis=1, inplace=True)
df = df[df['Phone'] != '+1.*']
df.to_csv('testingpandas.csv')

The #df.drop... component was just me playing with df.dop and deleting columns as a whole. The next line is my attempt at using this answer to maybe get the result I wanted, but I just can't quite find what is right. And the '*' at the end was what I thought was a wildcard.

Would it be best to filter the '+1's and then write something that then deletes rows without a total string count of 13 in that column?

Community
  • 1
  • 1
Mxracer888
  • 341
  • 1
  • 4
  • 14

1 Answers1

2

You could use pandas indexing in order to filter out rows, or write a custom function.

For the sake of the example I will show the 2 ways, keeping only rows that begin with +1 and of length 13.

pandas indexing:

df = df[(df['Phone'].str.startswith('+1')) & (df['Phone'].str.len() == 13)]

The other way, a custom method:

def filter_by_beginning_and_len(phone):
    return str(phone).startswith('+1') and len(str(phone)) == 13

df = df[df['Phone'].apply(filter_by_beginning_and_len)]
DeepSpace
  • 78,697
  • 11
  • 109
  • 154
  • I tried both and they both error out. The top one seems to be a missing ')' as well as ']'. I think I got them put in the correct place, but then when it runs it just returns a blank document with only the header row (Name, Email, Phone). To confirm, here is the code: **df = df[(df['Phone'].str.startswith('+1') & (df['Phone']).str.len() == 13)]** I added the ']' after the & df['Phone' par, and the ')' after the 13 at the end, but before the closing ]. Just want to make sure that is correct. – Mxracer888 Nov 28 '15 at 22:22
  • The other one errors out saying "Name Error: global name 'x' is not defined. Unfortunately, I do not know how to fix that one. Sorry, I am still rather new at this :/ – Mxracer888 Nov 28 '15 at 22:24
  • Sorry, I indeed had some syntax errors. I fixed both of my examples so please try them again. If you still get an empty dataframe it means that apparently no row had a value in the Phone column that both starts with `+1` and is 13 chars long (including the `+1` and the `.`). – DeepSpace Nov 28 '15 at 22:27
  • Great answer. Works fine on python3 pandas. – Binita Bharati May 26 '22 at 06:42