I have the following script that does the following:
- Extracts all text from a PowerPoint (all separated by a ":::")
- Compares each term in my search term list to the text and isolates just those lines of text that contain one or more of the terms
- Creates a dataframe for the term + file which that term appeared
- Iterates through each PowerPoint for the given folder
I am hoping to adjust this to include specifically the sentence in which it appears (e.g. the entire content between the ::: before and ::: after the term appears).
end = r'C:\Users\xxx\Table Lookup.xlsx'
rfps = r'C:\Users\xxx\Folder1'
ls = os.listdir(rfps)
ppt = [s for s in ls if '.ppt' in s]
files = []
text = []
for p in ppt:
try:
prs_text = []
prs = Presentation(os.path.join(rfps, p))
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text"):
prs_text.append(shape.text)
prs_text = ':::'.join(prs_text)
files.append(p)
text.append(prs_text)
except:
print("Failed: " + str(p))
agg = pd.DataFrame()
agg['File'] = files
agg['Unstructured'] = text
agg['Unstructured'] = agg['Unstructured'].str.lower()
terms = ['test','testing']
a = [(x, z, i) for x, z, y in zip(agg['File'],agg['Unstructured'], agg['Unstructured']) for i in terms if i in y]
#how do I also include the sentence where this term appears
onepager = pd.DataFrame(a, columns=['File', 'Unstructured', 'Term']) #will need to add a column here
onepager = onepager.drop_duplicates(keep="first")
1 line sample of agg:
File | Unstructured
File1.pptx | competitive offerings:::real-time insights and analyses for immediate use:::disruptive “moves”:::deeper strategic insights through analyses generated and assessed over time:::launch new business models:::enter new markets::::::::::::internal data:::external data:::advanced computing capabilities:::insights & applications::::::::::::::::::machine learning
write algorithms that continue to “learn” or test and improve themselves as they ingest data and identify patterns:::natural language processing
allow interactions between computers and human languages using voice and/or text. machines directly interact, analyze, understand, and reproduce information:::intelligent automation
Adjustment based on input:
onepager = pd.DataFrame(a, columns=['File', 'Unstructured','Term'])
for t in terms:
onepager['Sentence'] = onepager["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find(t))+3: x.find(":::",x.find(t))-3])