0

I have downloaded ~100 stored procs as .txt files from SQL Server. From these txt files I am looking to record every iteration of a keyword beginning with "XXX". So every time the word occurs in the script, it is placed into a dataframe with the name of the file next to it.

For example:

File: fileone

Script: "AAA BBB CCC XXXA XXXB DDD"

Would return:

Keyword File
XXXA fileone
XXXB fileone

I have a dataframe of my keywords and would like to loop this across all of my files.

Ideally, resulting in an output that looks like:

Keyword File File File
XXXA fileone filetwo filethree
XXXB fileone filetwo null
XXXC null null filethree

Below is the code that I am using to return the keyword list: I am doing this by taking the combined script of all of my stored procs (copy and pasted into one txt file) and finding all of the keywords that being with "XXX".

with open(allprocs, 'r') as f:
for line in f:
    for word in line.split():
        if word.startswith('XXX.'):
            list.append(word)

new_List = pd.unique(list).tolist()
df1 = pd.DataFrame(new_List,
              columns = ['Tables'])
df1 = df1.drop_duplicates()

1 Answers1

0

As I don't have your data, I provide a solution that generates a dataset for a directory containing some python scripts, and I am looking for words starting with n.

First we need a list of all relevant files in that directory, so we can access them one-by-one and avoid manually copying and pasting contents.

import glob
files = glob.glob("/PATH/*.py")

Next we'll generate a tidy dataframe with a keyword-file mapping.

import pandas as pd
import nltk  # optional
collect = []
for file in files:
    with open(file, 'r') as file_handle:
        for line in file_handle:
            # for word in line.split():
            for word in nltk.word_tokenize(line): # optional
                if word.startswith('n'):
                    collect.append({'keyword': word, 'filename': file.split('/')[-1]})

words_files_tidy = pd.DataFrame.from_records(collect).drop_duplicates()

This gets us a dataframe you first described.

keyword filename
XXXA fileone
XXXC fileone
XXXB filetwo
XXXC filetwo
XXXA filethree

Finally, pivot the dataset to get the desired result.

final_df = words_files_tidy.pivot(index='keyword', columns='filename', values='filename').reset_index()

Which will get you

Keyword fileone filetwo filethree
XXXA fileone null filethree
XXXB fileone filetwo null
XXXC null null filethree

rename the columns if necessary.

Ufos
  • 3,083
  • 2
  • 32
  • 36
  • This is great and does exactly what I want it to. If I wanted to return just the filename and not the entire path, how would I do this? One way I'd think to do it would be to write something like "file_handle = file_handle[61:-20]" would this work? – smitty_werben_jagerm Aug 11 '21 at 12:53
  • Solved this by adding "filename = file[61:-20]" under the with open line and changing the collect.append line to have " 'filename': filename " which cut off the C:users...... and the .txt extensions. Thank you @ufos for the help – smitty_werben_jagerm Aug 11 '21 at 13:14
  • @smitty_werben_jagerm don't use any hacks like that! You can see my quick-and-dirty `file.split('/')[-1]`, which is more reliable. Generally, refer to this thread https://stackoverflow.com/questions/678236/how-to-get-the-filename-without-the-extension-from-a-path-in-python – Ufos Aug 11 '21 at 13:20