Find each keyword in a text file and record with file name

Question

I have downloaded ~100 stored procs as .txt files from SQL Server. From these txt files I am looking to record every iteration of a keyword beginning with "XXX". So every time the word occurs in the script, it is placed into a dataframe with the name of the file next to it.

For example:

File: fileone

Script: "AAA BBB CCC XXXA XXXB DDD"

Would return:

Keyword	File
XXXA	fileone
XXXB	fileone

I have a dataframe of my keywords and would like to loop this across all of my files.

Ideally, resulting in an output that looks like:

Keyword	File	File	File
XXXA	fileone	filetwo	filethree
XXXB	fileone	filetwo	null
XXXC	null	null	filethree

Below is the code that I am using to return the keyword list: I am doing this by taking the combined script of all of my stored procs (copy and pasted into one txt file) and finding all of the keywords that being with "XXX".

with open(allprocs, 'r') as f:
for line in f:
    for word in line.split():
        if word.startswith('XXX.'):
            list.append(word)

new_List = pd.unique(list).tolist()
df1 = pd.DataFrame(new_List,
              columns = ['Tables'])
df1 = df1.drop_duplicates()

score 0 · Accepted Answer · answered Aug 10 '21 at 17:28

As I don't have your data, I provide a solution that generates a dataset for a directory containing some python scripts, and I am looking for words starting with n.

First we need a list of all relevant files in that directory, so we can access them one-by-one and avoid manually copying and pasting contents.

import glob
files = glob.glob("/PATH/*.py")

Next we'll generate a tidy dataframe with a keyword-file mapping.

import pandas as pd
import nltk  # optional
collect = []
for file in files:
    with open(file, 'r') as file_handle:
        for line in file_handle:
            # for word in line.split():
            for word in nltk.word_tokenize(line): # optional
                if word.startswith('n'):
                    collect.append({'keyword': word, 'filename': file.split('/')[-1]})

words_files_tidy = pd.DataFrame.from_records(collect).drop_duplicates()

This gets us a dataframe you first described.

keyword	filename
XXXA	fileone
XXXC	fileone
XXXB	filetwo
XXXC	filetwo
XXXA	filethree

Finally, pivot the dataset to get the desired result.

final_df = words_files_tidy.pivot(index='keyword', columns='filename', values='filename').reset_index()

Which will get you

Keyword	fileone	filetwo	filethree
XXXA	fileone	null	filethree
XXXB	fileone	filetwo	null
XXXC	null	null	filethree

rename the columns if necessary.

This is great and does exactly what I want it to. If I wanted to return just the filename and not the entire path, how would I do this? One way I'd think to do it would be to write something like "file_handle = file_handle[61:-20]" would this work? — smitty_werben_jagerm, Aug 11 '21 at 12:53
Solved this by adding "filename = file[61:-20]" under the with open line and changing the collect.append line to have " 'filename': filename " which cut off the C:users...... and the .txt extensions. Thank you @ufos for the help — smitty_werben_jagerm, Aug 11 '21 at 13:14
@smitty_werben_jagerm don't use any hacks like that! You can see my quick-and-dirty `file.split('/')[-1]`, which is more reliable. Generally, refer to this thread https://stackoverflow.com/questions/678236/how-to-get-the-filename-without-the-extension-from-a-path-in-python — Ufos, Aug 11 '21 at 13:20

Find each keyword in a text file and record with file name

1 Answers1