0

I have a csv file and i already chunked one of the columns in it and wanted to put the result of my chunks into separate columns in the csv file, by converting them into list using this code, but i kept on getting this error

IndexError: list index out of range

tag_count_df = pd.DataFrame(news['entityrecognition'].map(lambda x: Counter(tag[1] for tag in x)).to_list())

Below are my current code,

news=pd.read_csv("news.csv")

news['tokenize'] = news.apply(lambda row: nltk.word_tokenize(row['STORY']), axis=1)


news['pos_tags'] = news.apply(lambda row: nltk.pos_tag(row['tokenize']), axis=1)

news['entityrecog']=news.apply(lambda row: nltk.ne_chunk(row['pos_tags']), axis=1)

tag_count_df = pd.DataFrame(news['entityrecognition'].map(lambda x: Counter(tag[1] for tag in x)).to_list())

news=pd.concat([news, tag_count_df], axis=1).fillna(0).drop(['entityrecognition'], axis=1)

news.to_csv("news.csv")

Sample of my news.csv

ID STORY
1  Washington, a police officer James...

The result i wanted

ID  STORY                                       PERSON  NE   NP  NN VB  GE
1   Washington, a police officer James...        1      0    0   0   0   1
  • not too familiar with map and counter but should it not be tag[0] since you call only one column? – Kempie Mar 10 '20 at 09:01
  • @kempie the tag is to count how many person, VBZ, VB it is in the sentence – strawberrylatte Mar 10 '20 at 11:23
  • @if i did tag[0] it would count the no of word instead, like washington = 1 a =1 police =1 – strawberrylatte Mar 10 '20 at 12:54
  • I think you will get a faster answer if you can include a sample of your input news.csv in a pandas ready format. Ps. Do you then still get the Error: IndexError: list index out of range? – Kempie Mar 10 '20 at 13:32
  • @kempie edited my question, and yes is till got the error – strawberrylatte Mar 10 '20 at 14:50
  • This is all a bit new to me but It seems that nltk.ne_chunk returns tree structures. I'm not a 100% sure what your expected outcome should look like for all scenario's and only tested it on the string "Washington, a police officer James" but it seems like retrieving the list labels are what you are after. Try changing the code to: tag_count_df = pd.DataFrame(df['entityrecog'].map(lambda x: Counter(tag.label() for tag in x if type(tag) is nltk.Tree)).to_list()) – Kempie Mar 10 '20 at 17:53
  • Also, I guessed that the column name 'entityrecognition' should actually be 'entityrecog'. – Kempie Mar 10 '20 at 17:58
  • That multiple `.apply` looping across the whole dataframe again and again will end up with really slow code: https://stackoverflow.com/questions/47769818/why-is-my-nltk-function-slow-when-processing-the-dataframe – alvas Mar 10 '20 at 22:16

0 Answers0