0

I have the following script that does the following:

  1. Extracts all text from a PowerPoint (all separated by a ":::")
  2. Compares each term in my search term list to the text and isolates just those lines of text that contain one or more of the terms
  3. Creates a dataframe for the term + file which that term appeared
  4. Iterates through each PowerPoint for the given folder

I am hoping to adjust this to include specifically the sentence in which it appears (e.g. the entire content between the ::: before and ::: after the term appears).

end = r'C:\Users\xxx\Table Lookup.xlsx'
rfps = r'C:\Users\xxx\Folder1'
ls = os.listdir(rfps)
ppt = [s for s in ls if '.ppt' in s]

files = []
text = []

for p in ppt:
    try:
        prs_text = []
        prs = Presentation(os.path.join(rfps, p))
        for slide in prs.slides:
            for shape in slide.shapes:
                if hasattr(shape, "text"):
                    prs_text.append(shape.text)
        prs_text = ':::'.join(prs_text)
        files.append(p)
        text.append(prs_text)
    except:
        print("Failed: " + str(p))

agg = pd.DataFrame()
agg['File'] = files
agg['Unstructured'] = text
agg['Unstructured'] = agg['Unstructured'].str.lower()

terms = ['test','testing']

a = [(x, z, i) for x, z, y in zip(agg['File'],agg['Unstructured'], agg['Unstructured']) for i in terms if i in y]
#how do I also include the sentence where this term appears

onepager = pd.DataFrame(a, columns=['File', 'Unstructured', 'Term']) #will need to add a column here
onepager = onepager.drop_duplicates(keep="first")

1 line sample of agg:

File | Unstructured
File1.pptx | competitive offerings:::real-time insights and analyses for immediate use:::disruptive “moves”:::deeper strategic insights through analyses generated and assessed over time:::launch new business models:::enter new markets::::::::::::internal data:::external data:::advanced computing capabilities:::insights & applications::::::::::::::::::machine learning
write algorithms that continue to “learn” or test and improve themselves as they ingest data and identify patterns:::natural language processing
allow interactions between computers and human languages using voice and/or text. machines directly interact, analyze, understand, and reproduce information:::intelligent automation

Adjustment based on input:

onepager = pd.DataFrame(a, columns=['File', 'Unstructured','Term'])
for t in terms:
    onepager['Sentence'] = onepager["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find(t))+3: x.find(":::",x.find(t))-3])
RCarmody
  • 712
  • 1
  • 12
  • 29
  • Please include a sample of `agg`. See [MRE pandas](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – not_speshal Jun 28 '21 at 21:54
  • Updated to include one line of the data... notice how it's an extremely long concatenation of text in the Unstructured field – RCarmody Jun 28 '21 at 21:57
  • And what is your expected output? Neither `test` not `testing` is in your "extremely long" text. – not_speshal Jun 28 '21 at 22:01
  • Adjusted the data to include "test" - my real search terms are different, and private hence why I replaced with generic terms – RCarmody Jun 28 '21 at 22:07

1 Answers1

0

To find the sentence containing the word "test", try:

>>> agg["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find("test"))+3: x.find(":::",x.find("test"))-3])

Looping through your terms:

onepager = pd.DataFrame(a, columns=['File', 'Unstructured','Term'])
for t in terms:
    onepager[term] = onepager["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find(t))+3: x.find(":::",x.find(t))-3])
not_speshal
  • 22,093
  • 2
  • 15
  • 30
  • I've added this piece into my code and made it iterate through all of my search terms... I am not getting an error, but just seemingly random outputs (also the exact same output for each file, not unique to the term). Is there a specific reason why this would not work? – RCarmody Jun 28 '21 at 23:22
  • You keep assigning the output to the "Sentence" column. I'm not sure what you're trying to do. You need a new column for each term. Try my edited answer. – not_speshal Jun 29 '21 at 12:20