How to extract Uppercase word in a string in all rows of a column in a pandas dataframe?

Question

Dataset is attached. In the column named as "transcription", I want to extract Uppercase word from a string from each and every row in a column and make it as a feature of a dataframe and the string following the uppercase word to be the value of that data point under that feature .

Expected output would be another column in the dataframe named as uppercase word found in a string and the particular data point would have a value under the feature. Tried my best to explain.

Dataset

Link of sample output Sample output (Shown for first 2 data points)

The question is very ambiguous. Please read [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) article about how to present what it is you are getting as an output and a sample of what you want the output dataframe to look like. — Ukrainian-serge, Feb 28 '20 at 23:25
@Ukrainian-serge Thanks for pointing out. I have attached a link of sample output for first 2 data points. I hope it clears this time — Haseeb Ahmed Khan, Feb 29 '20 at 00:09
By sample I mean something we can copy and paste to build a df locally with `pd.read_clipboard()`. Also an example dataframe of what you want it to look like after it's worked on. — Ukrainian-serge, Feb 29 '20 at 00:11
I tried to display what my dataframe looks like in the first picture and in the second one my expected output. — Haseeb Ahmed Khan, Feb 29 '20 at 01:47
Your data is quite lengthy. So it is understandable that you could only share link to input dataset or an image. In these kind of situation, I usually create a small dummy dataframe which represents the original data and then post question on dummy dataframe. — Varsha, Feb 29 '20 at 02:24

score 1 · Accepted Answer · edited Dec 14 '20 at 17:09

Try using this :

def cust_func(data):
    ## split the transcription with , delimiter - later we will join 
    words = data.split(",")
    
    ## get index of words which are completely in uppercase and also endswith :, 
    column_idx = []
    for i in range(len(words)):
        if ((words[i].endswith(":") or words[i].endswith(": ")) and words[i].isupper()):
            column_idx.append(i)
          
    ## Find the sentence for each of the capital word by joining the words
    ## between two consecutive capital words
    ## Save the cap word and the respective sentence in dict. 
    result = {}
    for i in range(len(column_idx)):
        if i != len(column_idx)-1:
            result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:column_idx[i+1]])
        else:
            result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:])
    return(pd.Series(result)) ## this creates new columns

df = pd.concat([df, df.transcription.apply(cust_func)], axis=1)
df

Output looks like this (Couldn't capture all the columns in one screenshot.):

In the first line of code which is mentioned below: words = data.split(",") But some of the Uppercase words in the string are followed by ", " (comma and space). As a result, for some it did not did what it was supposed to do. [link] (https://drive.google.com/open?id=1)jv07-5lr6h0kezRhoNJdGWqGx7gDGnPU — Haseeb Ahmed Khan, Mar 02 '20 at 17:56

How to extract Uppercase word in a string in all rows of a column in a pandas dataframe?

1 Answers1