3

I have a large excel file to clean around 200000 rows. So Im using pandas to drop unwanted rows if the conditions meet but it takes some time to run.

My current code looks like this

def cleanNumbers(number):  # checks number if it is a valid number
    vaild = True
    try:
        num = pn.parse('+' + str(number), None)
        if not pn.is_valid_number(num):
            vaild = False
    except:
        vaild = False
    return vaild

for UncleanNum in tqdm(TeleNum):
    valid = cleanNumbers(UncleanNum)  # calling cleanNumbers function
    if valid is False:
        df = df.drop(df[df.telephone == UncleanNum].index)  
        # dropping row if number is not a valid number

It takes around 30 min for this line of code to finish. Is there a more efficient way to drop rows with pandas? If not can I use numpy to have the same output?

Im not that aquainted with pandas or numpy so if you have any tips to share it would be helpful.

Edit:

Im using phonenumbers lib to check if the telephone number is valid. If its not a valid phonenumber i drop the row that number is on.

Example data

address     name    surname     telephone
Street St.  Bill    Billinson   7398673456897<--let say this is wrong
Street St.  Nick    Nick        324523452345
Street St.  Sam     Sammy       234523452345
Street St.  Bob     Bob         32452345234534<--and this too
Street St.  John    Greg        234523452345

Output

address     name    surname     telephone
Street St.  Nick    Nick        324523452345
Street St.  Sam     Sammy       234523452345
Street St.  John    Greg        234523452345

This is what my code does but it slow.

John Zapanta
  • 179
  • 1
  • 3
  • 14

1 Answers1

2

In my opinion here main bootleneck is not drop, but custom function repeating for large number of values.

Create list of all valid numbers and then filter by boolean indexing with Series.isin:

v = [UncleanNum for UncleanNum in tqdm(TeleNum) if cleanNumbers(UncleanNum)]

df = df[df.telephone.isin(v)]

EDIT:

After some testing solution should be simplify, because function return boolean:

df1 = df[df['telephone'].apply(cleanNumbers)]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Thanks it works also thanks for telling me about boolean indexing and Series.isin I never knew that it can be coded that way. – John Zapanta Aug 08 '19 at 01:22
  • @JohnZapanta - You are welcome! Also added another solution. – jezrael Aug 08 '19 at 06:15
  • If it wouldn't trouble you, can you explain how the code works cause, 'it works' but i want to know how. – John Zapanta Aug 08 '19 at 06:49
  • @JohnZapanta - Do you think first or second solution? – jezrael Aug 08 '19 at 06:50
  • Can you explain both. – John Zapanta Aug 08 '19 at 07:17
  • 1
    @JohnZapanta - Sure. In first is used looping by `TeleNum` and for each value is filtering, maybe the bset explanation is [here](https://www.pythonforbeginners.com/basics/list-comprehensions-in-python) or [this](https://stackoverflow.com/a/4406777). For second is used [`Series.apply`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) - for run function for all values of column with return another Series filled by True and False, so possible filtering by [`boolean indexing`](http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing) – jezrael Aug 08 '19 at 07:25
  • 1
    isin approach is FAST. Thanks – Allohvk Jun 08 '22 at 11:50