0

I downloaded the CSV file cleaned_depression_vs_suicide.csv from Kaggle to practice classification using different models. All my code works fine and I successfully ignored the null case for title and body of the reddits by using notnull() methods: (df1 is the CSV file)

df1 = df1[df1["Body"].notnull()]
df1 = df1[df1["Title"].notnull()]

However, it will shows KeyError on certain rows when the following line appears:

for i in range(0, nRow - 1):
    result1 = preprocessing(df1["Title"][i])
    result2 = preprocessing(df1["Body"][i])
    df2.loc[i, 'cleaned_text'] = result1 + " " + result2

My preprocessing methods:

Given a string s, pre-process s and return updated features and bag-of-words representation which includes:

  • split the text
  • changes to lower-case
  • remove punctuation
  • tokenize s using nlp (en_core_web_sm) to create a doc
  • update features with lemmas encountered in s

it told me where the key error is and I checked it: enter image description here

the normal one should look like this: enter image description here

There were a few that look exactly like this before I do anything and I changed the dataset manually to modify (as long as I changed it, the row from 0 to that line will function normally and display model accuracy, etc), what would be the reason that this situation happens and how to fix this problem without reading through the CSV file one by one?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
HappyDuppy
  • 37
  • 4
  • Please [do not post images](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question) in your questions. We can’t access that data so it’s not useful for anyone to help you debug. Try and [make a small example that you can post here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) that shows one of your `KeyError` issues. – Cimbali Jun 12 '21 at 16:09
  • 1
    Also, the traceback could hold valuable information, if you could show it. – Trevis Jun 12 '21 at 17:15
  • 1
    Please do not describe what you have done verbally; post a [mcve], which should be straightforward, since you are using a publicly available dataset (that is, if you include the link to the file, too). – desertnaut Jun 12 '21 at 17:34
  • I agree with the above best practices, but I have a hunch that in this case, you're getting bitten by pandas indexing. Try df1["Title"].iloc[i], etc. if you really want to iterate by numerical index. Or you can iterate over the series data instead: df1['cleaned_text'] = [preprocessing(t) + ' ' + preprocessing(b) for (t,b) in zip(df1['Title'], df1['Body'])]. – zgana Jun 12 '21 at 18:22

0 Answers0