0

Dataset is attached. In the column named as "transcription", I want to extract Uppercase word from a string from each and every row in a column and make it as a feature of a dataframe and the string following the uppercase word to be the value of that data point under that feature .

Expected output would be another column in the dataframe named as uppercase word found in a string and the particular data point would have a value under the feature. Tried my best to explain.

Dataset

Link of sample output Sample output (Shown for first 2 data points)

Current situation

Expected output to look like this

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Haseeb Ahmed Khan
  • 115
  • 1
  • 6
  • 14
  • The question is very ambiguous. Please read [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) article about how to present what it is you are getting as an output and a sample of what you want the output dataframe to look like. – Ukrainian-serge Feb 28 '20 at 23:25
  • @Ukrainian-serge Thanks for pointing out. I have attached a link of sample output for first 2 data points. I hope it clears this time – Haseeb Ahmed Khan Feb 29 '20 at 00:09
  • By sample I mean something we can copy and paste to build a df locally with `pd.read_clipboard()`. Also an example dataframe of what you want it to look like after it's worked on. – Ukrainian-serge Feb 29 '20 at 00:11
  • I tried to display what my dataframe looks like in the first picture and in the second one my expected output. – Haseeb Ahmed Khan Feb 29 '20 at 01:47
  • Your data is quite lengthy. So it is understandable that you could only share link to input dataset or an image. In these kind of situation, I usually create a small dummy dataframe which represents the original data and then post question on dummy dataframe. – Varsha Feb 29 '20 at 02:24

1 Answers1

1

Try using this :

def cust_func(data):
    ## split the transcription with , delimiter - later we will join 
    words = data.split(",")
    
    ## get index of words which are completely in uppercase and also endswith :, 
    column_idx = []
    for i in range(len(words)):
        if ((words[i].endswith(":") or words[i].endswith(": ")) and words[i].isupper()):
            column_idx.append(i)
          
    ## Find the sentence for each of the capital word by joining the words
    ## between two consecutive capital words
    ## Save the cap word and the respective sentence in dict. 
    result = {}
    for i in range(len(column_idx)):
        if i != len(column_idx)-1:
            result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:column_idx[i+1]])
        else:
            result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:])
    return(pd.Series(result)) ## this creates new columns

df = pd.concat([df, df.transcription.apply(cust_func)], axis=1)
df

Output looks like this (Couldn't capture all the columns in one screenshot.):

enter image description here

enter image description here

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Varsha
  • 319
  • 1
  • 5
  • In the first line of code which is mentioned below: words = data.split(",") But some of the Uppercase words in the string are followed by ", " (comma and space). As a result, for some it did not did what it was supposed to do. [link] (https://drive.google.com/open?id=1)jv07-5lr6h0kezRhoNJdGWqGx7gDGnPU – Haseeb Ahmed Khan Mar 02 '20 at 17:56