1

I've written code to extract acronyms and other key words out of a PDF to create a glossary list. I've removed duplicates and sorted alphabetically and created a panda dataframe.

A portion of this dataframe (called gloss_of_terms1) looks like this:

    Acronyms/Abbrev
    0       EMP
    1       EFT
    2       FCF
    3       FY14
    4       FY15
    5       FY16
    6       GDN
    7       GP

I effectively want to write some code to group the items FY14, FY15, FY16 (separated by spaces) into one line and have the resultant dataframe look as follows:

    Acronyms/Abbrev
    0       EMP
    1       EFT
    2       FCF
    3       FY14 FY15 FY16
    4       GDN
    5       GP

I'm having no luck finding the correct tool/code to do this. Please help!

Sayse
  • 42,633
  • 14
  • 77
  • 146
dweir247
  • 63
  • 4
  • Without an MCVE I will not make a code based answer, but your best approach is to do a check on the first two letters as your reading it and place it in a list. Then use pandas to make the nested list into a DataFrame. – Edeki Okoh Aug 12 '19 at 16:11

1 Answers1

0

Someone has acheived something similar (concatenate in a group by) here

But you will also need to create a dummy index that you can apply the group by to, in this case, a substring of the first two characters. You could modify this substring if you wanted something different on the group by

#Pandas library
import pandas as pd
#Create the dataset
df=pd.DataFrame({'Abbrev':['EMP','EFT','FCF','FY14','FY15','FY16','GDN','GP']})
#Dummy index with the substring (slice) to compare
df['ix']=df['Abbrev'].str.slice(0,2)
#Groupby concatenating the result
df=df.groupby(['ix'])['Abbrev'].apply(lambda x: ','.join(x)).reset_index()
#Drop the dummy index
df=df.drop(['ix'], axis=1)
#Show result
df

For example, if you would rather have an IF condition on the first two letters matching FY you would replace the 6th line (where the IX is created) with:

#Dummy index with the slice to compare
df['ix']=df['Abbrev'].apply(lambda x: x[0:2] if x[0:2]=='FY' else x)

Hope it helps, and welcome to Stack Overflow!

Ernesto
  • 605
  • 1
  • 13
  • 30
  • This is super helpful! Thank you Ernesto! The only problem is that if there are other acronyms that start with the same 2 letters (i.e. CCO and CCM), it'll group these too. Is there anyway to make the slice specific to "FY". Or should i rather create a seperate filtered dataframe for all the FY items, perform your code, then replace the individual FY items with the grouped by FY's (does this make sense? or am i over complicating it?) – dweir247 Aug 13 '19 at 07:43
  • You can replace the function youa re using to create the index. I put an example of that in the answer! – Ernesto Aug 13 '19 at 08:31
  • This is exactly what i needed! Thanks Ernesto - much appreciated! – dweir247 Aug 13 '19 at 09:13