I want to create a general program for monitoring purposes to see which inputdata is being used for various models in our company.
therefore, i want to loop through our (production) model folder and find all the .py of .ipynb files and open those, read them as a string using glob (and os). For now, i made a loop that looks for all scripts containing a csv (as a start):
path = directory
search_word = 'csv'
#list to store files that contain matching word
final_files = []
for folder_path, folders, files in os.walk(path):
#IPYNB files
path = folder_path+'\\*.IPYNB'
for filepath in glob.glob(path, recursive=True):
try:
with open(filepath) as fp:
# read the file as a string
data = fp.read()
if search_word in data:
final_files.append(filepath)
except:
print('Exception while reading file')
print(final_files)
This gives back, all IPYNB files containing the word csv in the script. So, i'm able toe read within the files.
What i want to have, is that within the part where now i'm searching for the 'CSV', i want the program to read the file (as doing right now) and determine which inputdata (and output in the end) is being used.
For example, 1 file (.IPYNB) contains this script part (input used for a model):
#Dataset 1
df1 = pd.read_csv('Data.csv', sep=';')
#dataset 2
sql_conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER=X;DATABASE=X;Trusted_Connection=yes')
query = "SELECT * FROM database.schema.data2"
df2 = pd.read_sql_query(query, sql_conn)
#dataset 3
sql_conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER=X;DATABASE=X;Trusted_Connection=yes')
query = "SELECT element1, element2 FROM database.schema.data3"
df3 = pd.read_sql_query(query, sql_conn)
How can i make the program such that it extracts the following facts:
- Data.csv
- database.schema.data2
- database.schema.data3
Anyone a good idea?
Thanks in advance!