0

I want to create a general program for monitoring purposes to see which inputdata is being used for various models in our company.

therefore, i want to loop through our (production) model folder and find all the .py of .ipynb files and open those, read them as a string using glob (and os). For now, i made a loop that looks for all scripts containing a csv (as a start):

path = directory
search_word = 'csv'
#list to store files that contain matching word
final_files = []
for folder_path, folders, files in os.walk(path):
    #IPYNB files
    path = folder_path+'\\*.IPYNB'
    for filepath in glob.glob(path, recursive=True):
        try:
            with open(filepath) as fp:
                # read the file as a string
                data = fp.read()
                if search_word in data:
                    final_files.append(filepath)
        except:
            print('Exception while reading file')
print(final_files)

This gives back, all IPYNB files containing the word csv in the script. So, i'm able toe read within the files.

What i want to have, is that within the part where now i'm searching for the 'CSV', i want the program to read the file (as doing right now) and determine which inputdata (and output in the end) is being used.

For example, 1 file (.IPYNB) contains this script part (input used for a model):

#Dataset 1
df1 = pd.read_csv('Data.csv', sep=';')

#dataset 2
sql_conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER=X;DATABASE=X;Trusted_Connection=yes') 
query = "SELECT * FROM database.schema.data2"
df2 = pd.read_sql_query(query, sql_conn)

#dataset 3
sql_conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER=X;DATABASE=X;Trusted_Connection=yes') 
query = "SELECT element1, element2 FROM database.schema.data3"
df3 = pd.read_sql_query(query, sql_conn)

How can i make the program such that it extracts the following facts:

  • Data.csv
  • database.schema.data2
  • database.schema.data3

Anyone a good idea?

Thanks in advance!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • This might be a duplicate of [Find all strings in python code files](https://stackoverflow.com/questions/585529/find-all-strings-in-python-code-files). But you need something extra (for example loosing the `';'`, and only part of the SQL statement), which makes this a very specific problem. – Luuk Jan 26 '23 at 13:44
  • hi Luuk, thanks for your reply: using StringIO i'm able to select only the line with for example CSV in it. Then the problem arises: how can i select only the words Data.csv from this? Using regex, this might give problems, given that i'm not sure everybody will be using/writing code in the same way. Thus, it should be something like: select the first element (full word/string) out of the read_csv function. – Joshua_1980 Jan 26 '23 at 15:51
  • A regex like: `['"].*['"]`, might get you started? see: https://regex101.com/r/2IGi8O/1, or maybe you should even use something like: `['"][^'"]*['"]` – Luuk Jan 26 '23 at 16:05
  • Thanks for this Luuk. regex somehow always seems to confuse me how to be used – Joshua_1980 Jan 26 '23 at 16:29

0 Answers0