Is there a faster, more efficient way to process my data?

Question

Further to a post I made a couple of weeks ago, I'm reading rows from a spreadsheet (nearly 215,000) and attempting to match them with text files contained in in a sub-directory. On average the number of text files files contained in the sub-directory is 14000. Although my code is working, it is taking an inordinate amount of time to copy the matched files to a second sub-directory. At this rate it's going to be end of August before the job is complete (average processing time is six hours)

Is there a way to improve the efficiency of this algorithm, or indeed is there a better way? My code is below

regards

import glob
import os,sys
import csv
import shutil
import pandas as pd
import fnmatch
import string
import xlrd
from os import listdir
from os.path import isfile 

MDA_Path = 'D:/1994_QTR3' # contains Loughram and MacDonald 10-K files for QTR3
MDA_Path_2 = 'D:/1994_QTR4' # Contains L&M 10-K files for QTR4
MDA_Path_3 = 'D:/1995_QTR1'
MDA_Path_4 = 'D:/1995_QTR2'
MDA_Path_5 = 'D:/1995_QTR3'
MDA_Path_6 = 'D:/1995_QTR4'
MDA_Path_7 = 'D:/1996_QTR1'
MDA_Path_8 = 'D:/1996_QTR2'
MDA_Path_9 = 'D:/1996_QTR3'
MDA_Path_10 = 'D:/1996_QTR4'
MDA_Path_11 = 'D:/1997_QTR1'
MDA_Path_12 = 'D:/1997_QTR2'
MDA_Path_13 = 'D:/1997_QTR3'
MDA_Path_14 = 'D:/1997_QTR4'
MDA_Path_15 = 'D:/1998/QTR1'
MDA_Path_16 = 'D:/1998/QTR2'
MDA_Path_17 = 'D:/1998/QTR3'
MDA_Path_18 = 'D:/1998/QTR4'
MDA_Path_19 = 'D:/1999/QTR1'
MDA_Path_20 = 'D:/1999/QTR2'
MDA_Path_21 = 'D:/1999/QTR3'
MDA_Path_22 = 'D:/1999/QTR4'
MDA_Path_23 = 'D:/2000/QTR1'
MDA_Path_24 = 'D:/2000/QTR2'
MDA_Path_25 = 'D:/2000/QTR3'
MDA_Path_26 = 'D:/2000/QTR4'
MDA_Path_27 = 'D:/2001/QTR1'
MDA_Path_28 = 'D:/2001/QTR2'
MDA_Path_29 = 'D:/2001/QTR3'
MDA_Path_30 = 'D:/2001/QTR4'
MDA_Path_31 = 'D:/2002/QTR1'
MDA_Path_32 = 'D:/2002/QTR2'
MDA_Path_33 = 'D:/2002/QTR3'
MDA_Path_34 = 'D:/2002/QTR4'

MDA_Target_List = r'D:/PhD_Data/Wenrui_Filing_list' # stores wenruis data 

MDA_For_Parsing_1994_QTR3 = 'D:/Required_MDA_1994_QTR3' # will hold all 10-Ks from wenrui's spreadsheet once detected
MDA_For_Parsing_1994_QTR4 = 'D:/Required_MDA_1994_QTR4'
MDA_For_Parsing_1995_QTR1 = 'D:/Required_MDA_1995_QTR1'
MDA_For_Parsing_1995_QTR2 = 'D:/Required_MDA_1995_QTR2'
MDA_For_Parsing_1995_QTR3 = 'D:/Required_MDA_1995_QTR3'
MDA_For_Parsing_1995_QTR4 = 'D:/Required_MDA_1995_QTR4'
MDA_For_Parsing_1996_QTR1 = 'D:/Required_MDA_1996_QTR1'
MDA_For_Parsing_1996_QTR2 = 'D:/Required_MDA_1996_QTR2'
MDA_For_Parsing_1996_QTR3 = 'D:/Required_MDA_1996_QTR3'
MDA_For_Parsing_1996_QTR4 = 'D:/Required_MDA_1996_QTR4'
MDA_For_Parsing_1997_QTR1 = 'D:/Required_MDA_1997_QTR1'
MDA_For_Parsing_1997_QTR2 = 'D:/Required_MDA_1997_QTR2'
MDA_For_Parsing_1997_QTR3 = 'D:/Required_MDA_1997_QTR3'
MDA_For_Parsing_1997_QTR4 = 'D:/Required_MDA_1997_QTR4'
MDA_For_Parsing_1998_QTR1 = 'D:/Required_MDA_1998_QTR1'
MDA_For_Parsing_1998_QTR2 = 'D:/Required_MDA_1998_QTR2'
MDA_For_Parsing_1998_QTR3 = 'D:/Required_MDA_1998_QTR3'
MDA_For_Parsing_1998_QTR4 = 'D:/Required_MDA_1998_QTR4'
MDA_For_Parsing_1999_QTR1 = 'D:/Required_MDA_1999_QTR1'
MDA_For_Parsing_1999_QTR2 = 'D:/Required_MDA_1999_QTR2'
MDA_For_Parsing_1999_QTR3 = 'D:/Required_MDA_1999_QTR3'
MDA_For_Parsing_1999_QTR4 = 'D:/Required_MDA_1999_QTR4'
MDA_For_Parsing_2000_QTR1 = 'D:/Required_MDA_2000_QTR1'
MDA_For_Parsing_2000_QTR2 = 'D:/Required_MDA_2000_QTR2'
MDA_For_Parsing_2000_QTR3 = 'D:/Required_MDA_2000_QTR3'
MDA_For_Parsing_2000_QTR4 = 'D:/Required_MDA_2000_QTR4'
MDA_For_Parsing_2001_QTR1 = 'D:/Required_MDA_2001_QTR1'
MDA_For_Parsing_2001_QTR2 = 'D:/Required_MDA_2001_QTR2'
MDA_For_Parsing_2001_QTR3 = 'D:/Required_MDA_2001_QTR3'
MDA_For_Parsing_2001_QTR4 = 'D:/Required_MDA_2001_QTR4'
MDA_FOR_Parsing_2002_QTR1 = 'D:/Required_MDA_2002_QTR1'
MDA_FOR_Parsing_2002_QTR2 = 'D:/Required_MDA_2002_QTR2'
MDA_FOR_Parsing_2002_QTR3 = 'D:/Required_MDA_2002_QTR3'
MDA_FOR_Parsing_2002_QTR4 = 'D:/Required_MDA_2002_QTR4'


# open the csv file and extract the column containing the location of the text file(s)

datas = pd.read_excel(r'D:/PhD_Data/Wenrui_Filing_list/1994-2017filingslist_Wenrui_13Jul2020.xlsx')

df = pd.DataFrame(datas, columns = ['FILE_NAME'])  # extract the data contained in FILE_NAME column

df['FILE_NAME'] = df['FILE_NAME'].str[26:]  # remove the first 26 characters which contain the edgar drive info

df['FILE_NAME'] = df['FILE_NAME'].str.strip() # remove all leading and trailing


file_length = len(df) # count number of files in Wenrui's list (will need this later to loop through all occurrences)
dirs = os.listdir(MDA_Path_32)
# dirs1 = os.listdir(MDA_Path_3)

for x in range(file_length):

  for file in dirs:
 # if file == df['FILE_NAME'][x]:
   if df['FILE_NAME'][x] in file:
     print(file)
     shutil.copy(MDA_Path_32 + '/' + file, MDA_FOR_Parsing_2002_QTR2) # Move it to QTR  directory```

First of all, stop using variables this way. It says nothing, it's bad to work with and it's an overall mess to have in the source. Your real code is 10 lines, the rest is unreadable/unmanagable in the long run. If you need a long list of variables, use a `dict` i.e. `{"MDA_PATH_1": , "..._2": , ...}`. Start looking for PATTERNS everywhere. The code is here to make your life easier, not make you copy-paste stuff. For example: `{f"path_{year}_{n}": f"D:/{year}/QTR{n}" for year in range(1994, 2003) for n in range(1, 5)}`. Then loop. Then optimize Python. Then use C/C++/Cython. — Peter Badida, Aug 09 '20 at 11:56
Thanks for the advice. Unfortunately C/C++ is not an option with this project, we're committed to using Python. — Dale Addison, Aug 09 '20 at 12:01
Also, https://codereview.stackexchange.com might help you a lot when going through the advices for just random posts. — Peter Badida, Aug 09 '20 at 12:02
The proportion of files so far extracted ranges between 10-60%, more from QTR1 as there are more filings then. The spreadsheet contains the list of ALL the text files needed, whilst the sub-directories contain ALL of the files from a particular quarter. — Dale Addison, Aug 09 '20 at 12:32
The first answer here may be helpful for an idea for speeding up file copying [Python multiprocess/multithreading to speed up file copying](https://stackoverflow.com/questions/44320331/python-multiprocess-multithreading-to-speed-up-file-copying) — DarrylG, Aug 09 '20 at 12:37
and 2) is there any way you can get the year-q from the excel? so you can remove the inner loop and do `if os.path.exists(f'D:/{year}/QTR{q}/{file}'):` — RichieV, Aug 09 '20 at 12:43
RichieV I've just checked the spreadsheet and the text FILE_NAME variable contains the drive path which includes the Year and Quarter information so this is another option. At the moment I'm removing the first 26 characters which contain that information to access the name of the text file. — Dale Addison, Aug 09 '20 at 12:57
You might want to make a folder `D:/Required_MDA/` with subfolders `2002_QTR4/`... it is easier to process just one path and its subs as in this [answer](https://stackoverflow.com/a/1724723/6692898) — RichieV, Aug 09 '20 at 13:07
one last thing... profile your code, I suspect copying is taking all the time... if you only want to read the files and save some results in another file then build a list of paths or use `os.link` — RichieV, Aug 09 '20 at 13:28
You think it's the copy command, not the Panda which is being filled with the required records? — Dale Addison, Aug 09 '20 at 13:42
@Dale, if you implement the suggestions from all comments then it will spend most of the time copying — RichieV, Aug 09 '20 at 13:43
Uncertain at the moment. By posting the code here I hoped people might be able to pin it down to either reading the spreadsheet into a panda or the actual locating of the data or the copying. So far no one has made a suggestion as to why it is so slow. — Dale Addison, Aug 09 '20 at 18:35
see this post on [how to profile a script](https://stackoverflow.com/q/582336/6692898)... have you tried any of the suggestions above? — RichieV, Aug 09 '20 at 18:38
I'm working on it as I write this. The best course of action would appear to be to identify the first and last row of the spreadsheet corresponding to the first and last data items for that quarter, copying them to a new spreadsheet and creating a panda from that sheet. Then doing the locate and copy. — Dale Addison, Aug 09 '20 at 18:41
again, if you already know `year` and `Q` from excel then you don't need to `if df['FILE_NAME'][x] in file:`, just `shutil.copy(f'D:/{year}/QTR{Q}/{df['FILE_NAME'][x], 'D:/Required_MDA_{year}_QTR{Q}/')` — RichieV, Aug 09 '20 at 18:49
BTW, it is just a bit faster to loop df with `for fn in df.FILE_NAME:` ...but I am sure this is not the bottleneck — RichieV, Aug 09 '20 at 18:51
Just tried your version and it reports the following error "TypeError: unsupported operand type(s) for /: 'str' and 'set' " — Dale Addison, Aug 10 '20 at 15:14
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/219551/discussion-between-richiev-and-dale-addison). — RichieV, Aug 10 '20 at 15:15

Is there a faster, more efficient way to process my data?

0 Answers0