0

Right now I've got a process to extract a large amount of data (~1.5m rows) from multiple URLs. The current process works flawlessly but it is extremely slow (~4 mins) and inefficient so I was looking for help.

The way I'm extracting the data is by having a list of dates in a list and plugging them into a URL template to extract. The following is an example, actual list contains 52 elements.

list =  ['121229','121222','121215','121208','121201','121124','121117','121110']

the current code is as follows:

def dataPull(year):



    columnsName = ['C/A','UNIT','SCP','DATE1','TIME1','DESC1','ENTRIES1','EXITS1','DATE2','TIME2','ESC2',\
                     'ENTRIES2','EXITS2','DATE3','TIME3','DESC3','ENTRIES3','EXITS3','DATE4','TIME4','DESC4',\
                     'ENTRIES4','EXITS4','DATE5','TIME5','DESC5','ENTRIES5','EXITS5','DATE6','TIME6','DESC6',\
                     'ENTRIES6','EXITS6','DATE7','TIME7','DESC7','ENTRIES7','EXITS7','DATE8','TIME8','DESC8',\
                     'ENTRIES8','EXITS8']

    df = pd.DataFrame(columns=columnsName)

    for i in dateList:
        if str(year)[2:5] == str(i)[:2]:
            tempUrl ='http://web.mta.info/developers/data/nyct/turnstile/turnstile_'+str(i)+'.txt'
            tempDf = pd.read_csv(tempUrl, header=None, engine = 'python', error_bad_lines=False, warn_bad_lines=False)
            tempDf.columns = columnsName
            df= pd.concat([df, tempDf])

    return df

output looks as such:

     C/A        UNIT    SCP     DATE1   TIME1   DESC1   ENTRIES1    EXITS1  DATE2   TIME2   ESC2    ENTRIES2    EXITS2  DATE3   TIME3   DESC3   ENTRIES3    EXITS3  DATE4   TIME4   DESC4   ENTRIES4    EXITS4  DATE5   TIME5   DESC5   ENTRIES5    EXITS5  DATE6   TIME6   DESC6   ENTRIES6    EXITS6  DATE7   TIME7   DESC7   ENTRIES7    EXITS7  DATE8   TIME8   DESC8   ENTRIES8    EXITS8
0   A002    R051    02-00-00    04-20-13    00:00:00    REGULAR 4084276 1405308 04-20-13    04:00:00    REGULAR 4084308.0   1405312.0   04-20-13    08:00:00    REGULAR 4084332.0   1405348.0   04-20-13    12:00:00    REGULAR 4084429.0   1405441.0   04-20-13    16:00:00    REGULAR 4084714.0   1405494.0   04-20-13    20:00:00    REGULAR 4085107.0   1405550.0   04-21-13    00:00:00    REGULAR 4085286.0   1405578.0   04-21-13    04:00:00    REGULAR 4085317.0   1405582.0
1   A002    R051    02-00-00    04-21-13    08:00:00    REGULAR 4085336 1405603 04-21-13    12:00:00    REGULAR 4085421.0   1405673.0   04-21-13    16:00:00    REGULAR 4085543.0   1405725.0   04-21-13    20:00:00    REGULAR 4085543.0   1405781.0   04-22-13    00:00:00    REGULAR 4085669.0   1405820.0   04-22-13    04:00:00    REGULAR 4085684.0   1405825.0   04-22-13    08:00:00    REGULAR 4085715.0   1405929.0   04-22-13    12:00:00    REGULAR 4085878.0   1406175.0
2   A002    R051    02-00-00    04-22-13    16:00:00    REGULAR 4086116 1406242 04-22-13    20:00:00    REGULAR 4086986.0   1406310.0   04-23-13    00:00:00    REGULAR 4087164.0   1406335.0   04-23-13    04:00:00    REGULAR 4087172.0   1406339.0   04-23-13    08:00:00    REGULAR 4087214.0   1406441.0   04-23-13    12:00:00    REGULAR 4087390.0   1406685.0   04-23-13    16:00:00    REGULAR 4087738.0   1406741.0   04-23-13    20:00:00    REGULAR 4088682.0   1406813.0
3   A002    R051    02-00-00    04-24-13    00:00:00    REGULAR 4088879 1406839 04-24-13    04:00:00    REGULAR 4088890.0   1406845.0   04-24-13    08:00:00    REGULAR 4088934.0   1406951.0   04-24-13    12:00:00    REGULAR 4089105.0   1407209.0   04-24-13    16:00:00    REGULAR 4089378.0   1407269.0   04-24-13    20:00:00    REGULAR 4090319.0   1407336.0   04-25-13    00:00:00    REGULAR 4090535.0   1407365.0   04-25-13    04:00:00    REGULAR 4090550.0   1407370.0
4   A002    R051    02-00-00    04-25-13    08:00:00    REGULAR 4090589 1407469 04-25-13    08:57:03    DOOR OPEN   4090629.0   1407591.0   04-25-13    08:58:01    LOGON   4090629.0   1407591.0   04-25-13    09:01:08    LGF-MAN 4090629.0   1407591.0   04-25-13    09:01:53    LOGON   4090629.0   1407591.0   04-25-13    09:02:02    DOOR CLOSE  4090629.0   1407591.0   04-25-13    09:02:04    DOOR OPEN   4090629.0   1407591.0   04-25-13    09:02:31    DOOR CLOSE  4090629.0   1407591.0
5   A002    R051    02-00-00    04-25-13    09:02:32    DOOR OPEN   4090629 1407591 04-25-13    09:07:21    LOGON   4090629.0   1407591.0   04-25-13    09:12:12    LGF-MAN 4090642.0   1407592.0   04-25-13    09:12:20    DOOR CLOSE  4090642.0   1407592.0   04-25-13    12:00:00    REGULAR 4090743.0   1407723.0   04-25-13    16:00:00    REGULAR 4091064.0   1407793.0   04-25-13    20:00:00    REGULAR 4092044.0   1407840.0   04-26-13    00:00:00    REGULAR 4092314.0   1407859.0
6   A002    R051    02-00-00    04-26-13    04:00:00    REGULAR 4092325 1407861 04-26-13    08:00:00    REGULAR 4092363.0   1407958.0   04-26-13    12:00:00    REGULAR 4092541.0   1408225.0   04-26-13    16:00:00    REGULAR 4092837.0   1408285.0   04-26-13    20:00:00    REGULAR 4093823.0   1408341.0   None    None    None    NaN NaN None    None    None    NaN NaN None    None    None    NaN NaN

Any help is greatly appreciated!

Gorlomi
  • 515
  • 2
  • 11
  • Repeated concatenation will destroy performance. – AMC Feb 16 '20 at 02:41
  • @AMC Any recommendation of alternatives? – Gorlomi Feb 16 '20 at 02:41
  • 1
    Try creating a list of all the DataFrames, and then use a single `pandas.concat()`. Can you share what the output should look like? Also, your variable names aren't following PEP 8. – AMC Feb 16 '20 at 02:42
  • @AMC sure I'll update soon – Gorlomi Feb 16 '20 at 02:43
  • 1
    @AMC PEP 8 is not a requisite standard, purely preferential. – glotchimo Feb 16 '20 at 02:44
  • 1
    @th0nk- It's not the law, but it is a convention. There are perfectly valid reasons to do things differently. – AMC Feb 16 '20 at 02:45
  • 1
    Relevant question: https://stackoverflow.com/questions/36489576/why-does-concatenation-of-dataframes-get-exponentially-slower – AMC Feb 16 '20 at 02:49
  • 2
    A tiny thing: You can set column names in `pandas.read_csv()`, there's no need to assign to the `.columns` attribute after. – AMC Feb 16 '20 at 02:51
  • Also, are you still open to input/feedback on https://stackoverflow.com/questions/60241444/improve-efficiency-of-pandas-transformation-process ? If you can clarify what the input data is, I can take a look. – AMC Feb 16 '20 at 03:15

1 Answers1

1

Don't use string concatenation, because it is slow. Use one of two other approaches:

Use str.join, i.e.

"".join([str1, str2, str3])

In your case,

"".join(['http://web.mta.info/developers/data/nyct/turnstile/turnstile_', str(i), '.txt'])

That should improve your speed pretty significantly.

Look at this post for some context.

Or, what consists of even less instructions, use an f-string, like this:

f'http://web.mta.info/developers/data/nyct/turnstile/turnstile_{str(i)}.txt'
glotchimo
  • 664
  • 5
  • 23
  • Is string concatenation causing performance issues, more so than the 52 DataFrame concatenations? – AMC Feb 16 '20 at 02:42
  • `DataFrame` concatenation is *very* different from `str`concatenation. The Pandas toolchain is highly optimized for its own uses, and you can't do much about that. – glotchimo Feb 16 '20 at 02:43
  • @th0nk- this actually helped reduce time by 10%! but it seems like the biggest problem is the multiple concats – Gorlomi Feb 16 '20 at 02:48
  • @Gorlomi Can you try replacing the concat/join with an f-string? – AMC Feb 16 '20 at 02:50
  • Yes ^ if you look at the bytecode, f-string is less steps. Updating. – glotchimo Feb 16 '20 at 02:55