1

I am working on a large dataset where I need to read excel files and then find valid numbers but the task takes enormous time for only 500k data. For valid numbers, I am using google phonelib. processing can be done in an async way as they are independent.

parts = dask.delayed(pd.read_excel)('500k.xlsx')
data = dd.from_delayed(parts)
data['Valid'] = data['Mobile'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x)),meta=('Valid','object'))

for background

phonenumbers.is_valid_number(phonenumbers.parse('+442083661177'))

gives output as True

I expect the output to be less than 10sec but it takes around 40s

James Z
  • 12,209
  • 10
  • 24
  • 44
Vraj Shah
  • 13
  • 1
  • 5
  • 1
    Possible duplicate of [Faster way to read Excel files to pandas dataframe](https://stackoverflow.com/questions/28766133/faster-way-to-read-excel-files-to-pandas-dataframe) – emilanov Aug 20 '19 at 08:35
  • @emilanov Reading a Dataframe is not the only concern I also want to process the dataframe in a parallel manner – Vraj Shah Aug 20 '19 at 09:23
  • I'm not sure how many people know "lac" as a suffix, writing it as 500k might make it more accessible – Sam Mason Aug 20 '19 at 09:25
  • Okay! @SamMason I have edited the code – Vraj Shah Aug 20 '19 at 09:28
  • I didn't notice it in the code, I was meaning in the description at the top :) – Sam Mason Aug 20 '19 at 09:30
  • depending on whether `phonelib` [releases the GIL](https://stackoverflow.com/q/1294382/1358308) you might be better off with a different dask scheduler https://docs.dask.org/en/latest/scheduling.html using something other than threads would bypass the GIL and might help – Sam Mason Aug 20 '19 at 09:50
  • @SamMason I tried but still the speed is slow can you try working on it ! populate a csv file with same number and then make it huge and please can you tell me if you can make it fast – Vraj Shah Aug 20 '19 at 14:31
  • it takes my computer approx 100ms to run your parse and validate code 1000 times, which would imply 500k items would take at least 50 seconds. this is pretty similar to your 40seconds, and makes me wonder if you're getting any parallelism out of dask at the moment? – Sam Mason Aug 20 '19 at 15:12

1 Answers1

1

just been playing with this and you might just need to repartition your dataframe to allow the computation to be run in parallel

I start by generating some data:

import csv
import random
with open('tmp.csv', 'w') as fd:
    out = csv.writer(fd)
    out.writerow(['id', 'number'])
    for i in range(500_000):
        a = random.randrange(1000, 2999)
        b = random.randrange(100_000, 899_999)
        out.writerow([i+1, f'+44 {a} {b}'])

note that these are mostly valid UK numbers.

I then run something similar to your code:

from dask.distributed import Client
import dask.dataframe as dd
import phonenumbers

def fn(num):
    return phonenumbers.is_valid_number(phonenumbers.parse(num))

with Client(processes=True):
    df = dd.read_csv('tmp.csv')

    # repartition to increase parallelism
    df = df.repartition(npartitions=8)

    df['valid'] = df.number.apply(fn, meta=('valid', 'object'))

    out = df.compute()

this takes ~20 seconds to complete on my laptop (4 cores, 8 threads, Linux 5.2.8), which is only a bit more than double the performance of the plain loop. which indicates dask has quite a bit of runtime overhead as I'd expect it to be much faster than that. if I remove the call to repartition it takes a longer than I'm willing to wait and top only shows a single process running

note that if I rewrite it to do the naive thing in multiprocessing I get much better results:

from multiprocessing import Pool
import pandas as pd

df = pd.read_csv('tmp.csv')

with Pool(4) as pool:
    df['valid'] = pool.map(fn, df['number'])

which reduces runtime to ~11 seconds and is even less code here as a bonus

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • Reading csv is fast @SamMason any way to speed up reading excel files also ? – Vraj Shah Aug 22 '19 at 10:18
  • note: I'm strongly biased against anything microsoft... why not just convert to CSV as soon as possible and "archive" the excel files, they're generally not very useful for anything apart from sharing with non-computer-literate people – Sam Mason Aug 22 '19 at 10:23
  • Well that is the task since excel is cumbersome I am supposed to convert it to CSV but it takes more time than expected @SamMason so maybe if you could help me optimize it would be great. Currently reading csv from pandas take around 30s for 10L rows – Vraj Shah Aug 23 '19 at 09:05
  • is "10L rows" 1 million (i.e. 1e6) rows? people outside india don't really use lakh, you're better off sticking to [SI prefixes](https://en.wikipedia.org/wiki/Metric_prefix) on stackoverflow... reading a 1M row CSV file in pandas on my laptop takes approx 25ms per numeric column and 100ms per text column, i.e. 0.3 seconds for 4 numeric and 2 text columns. there are lots of articles about optimising Python/numeric code, if you can't find anything maybe post another question here or on https://codereview.stackexchange.com – Sam Mason Aug 23 '19 at 10:16
  • If you're CSV file contatins +99001166773 – Vraj Shah Sep 10 '19 at 14:30
  • The whole system gets hang – Vraj Shah Sep 10 '19 at 14:30
  • I get a valid exception parsing that. maybe find some tutorials on exception handling in Python, or post another question... – Sam Mason Sep 11 '19 at 08:45