Wrong display of csv content approximately once every 10000 rows

Question

I downloaded the CSV file cleaned_depression_vs_suicide.csv from Kaggle to practice classification using different models. All my code works fine and I successfully ignored the null case for title and body of the reddits by using notnull() methods: (df1 is the CSV file)

df1 = df1[df1["Body"].notnull()]
df1 = df1[df1["Title"].notnull()]

However, it will shows KeyError on certain rows when the following line appears:

for i in range(0, nRow - 1):
    result1 = preprocessing(df1["Title"][i])
    result2 = preprocessing(df1["Body"][i])
    df2.loc[i, 'cleaned_text'] = result1 + " " + result2

My preprocessing methods:

Given a string s, pre-process s and return updated features and bag-of-words representation which includes:

split the text
changes to lower-case
remove punctuation
tokenize s using nlp (en_core_web_sm) to create a doc
update features with lemmas encountered in s

it told me where the key error is and I checked it:

the normal one should look like this:

There were a few that look exactly like this before I do anything and I changed the dataset manually to modify (as long as I changed it, the row from 0 to that line will function normally and display model accuracy, etc), what would be the reason that this situation happens and how to fix this problem without reading through the CSV file one by one?

Please [do not post images](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question) in your questions. We can’t access that data so it’s not useful for anyone to help you debug. Try and [make a small example that you can post here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) that shows one of your `KeyError` issues. — Cimbali, Jun 12 '21 at 16:09
Also, the traceback could hold valuable information, if you could show it. — Trevis, Jun 12 '21 at 17:15
Please do not describe what you have done verbally; post a [mcve], which should be straightforward, since you are using a publicly available dataset (that is, if you include the link to the file, too). — desertnaut, Jun 12 '21 at 17:34
I agree with the above best practices, but I have a hunch that in this case, you're getting bitten by pandas indexing. Try df1["Title"].iloc[i], etc. if you really want to iterate by numerical index. Or you can iterate over the series data instead: df1['cleaned_text'] = [preprocessing(t) + ' ' + preprocessing(b) for (t,b) in zip(df1['Title'], df1['Body'])]. — zgana, Jun 12 '21 at 18:22

Wrong display of csv content approximately once every 10000 rows

0 Answers0