1

So I've been working on data classification as part of a research project but since there are thousands of different values, I thought it best to use python to simplify the process rather than going through each record and classifying it manually.

So basically, I have a dataframe wherein one column is entitled "description" and another is entitled "codes". Each row in the "description" column contains a survey response about activities. The descriptions are all different but might contain some keywords. I have a list of some 40 codes to classify each row based on the text. I was thinking of manually creating some columns in the csv file and in each column, typing a keyword corresponding to each of the codes. Then, a loop (or function with a loop) is applied to the dataframe that goes through each row and if a specific substring is found that corresponds to any of the keywords, and then updated the "codes" column with the code corresponding to that keyword.

My Dilemma

For example:

Suppose the list of codes is "Dance", "Nap", "Run", and "Fight" that are in a separate dataframe column. This dataframe also with the manually entered keyword columns is shown below (can be more than two but I just used two for illustration purposes).

This dataframe is named "classes".

category Keyword1 Keyword2
Dance dance danc
Nap sleep slept
Run run quick
Fight kick unch

The other dataframe is as follows with the "codes" column initially blank.

This dataframe is named "data".

description codes
Iwasdancingthen
She Slept
He was stealing

The function or loop will search through the "description" column above and check if the keywords are in a given row. If they are, the corresponding codes are applied (as shown in the resulting dataframe below in bold). If not, the row in the "codes" column is left blank. The loop should run as many times as there are Keyword columns; the loop will run twice in this case since there are two keyword columns.

description codes
Iwasdancingthen Dance
She Slept Sleep
He landed a kick Fight
We are family

FYI: The keywords don't actually have to be complete words. I'd like to use partial words too as you see above.

Also, it should be noted that the loop or function I want to make should account for case sensitivity and strings that are combined.

I hope you understand what I'm trying to do.

What I tried:

At first, I tried using a dictionary and manipulate it somehow. I used the advice here:

search keywords in dataframe cell

However, this didn't work too well as I had many "Nan" values pop up and it became too complicated, so I tried a different route using lists. The code I used was based off another user's advice:

How to conditionally update DataFrame column in Pandas

Here's what I did:

# Create lists from the classes dataframe
Keyword1list = classes["Keyword1"].values.tolist()
Category = classes["category"].values.tolist()

I then used the following loop for classification

for i in range(len(Keyword1list)):
    data.loc[data["description"] == Keyword1list[i] , "codes"] = Category[i]

However, the resulting output still gives me "Nan" for all columns. Also, I don't know how to loop over every single keyword column (in this case, loop over the two columns "Keyword1" and "Keyword2").

I'd really appreciate it if anyone could help me with a function or loop that works. Thanks in advance!

Edit: It was pointed out to me that some descriptions might contain multiple keywords. I forgot to mention that the codes in the "classes" dataframe are ordered by rank so that the ones that appear first on the dataframe should take priority; for example, if both "dance" and "nap" are in a description, the code listed higher in the "classes" dataframe (i.e. dance) should be selected and inputted into the "codes" column. I hope there's a way to do that.

  • Part of the problem is that DataFrames aren't a good fit since there can be multiple codes for each description and multiple keywords for each code. – SargeATM Aug 20 '22 at 06:02
  • I have also ranked the codes so that the ones that appear first on the dataframe should take priority; for example, if both "dance" and "nap" are in a description, the code listed higher in the "classes" dataframe should be selected as the code. I would hope there's a way to do that. As for the multiple keywords, that's why I'm trying to use a loop. – Turnipsavocados Aug 20 '22 at 06:07

0 Answers0