0

Now, I need to process the data frame, with the following steps:

1\Split the clean_question column of the row on the space character (), and assign to split_question.
2\Remove any words in split_question that are less than 6 characters long.
3\Set match_count to 0.
4\Loop through each word in split_question.
5\If the term occurs in terms_used, add 1 to match_count.
6\Add each word in split_question to terms_used using the add method on sets.
7\If the length of split_question is greater than 0, divide match_count by the length of split_question.
8\Append match_count to question_overlap.

In fact, I wrote the code like this:

for index, series in jeopardy.iterrows():
    match_count = 0
    split_question = series.clean_question.split(' ')
    for i in split_question:
        if len(i) < 6:
            split_question.remove(i)
    for i in split_question:
        if i in terms_used:
            match_count += 1
        terms_used.add(i)
    if len(split_question) > 0 :
        question_overlap.append(match_count/len(split_question))

However, the example code outputs the different mean value with mine, the example is:

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)

I spent much time trying to fix the bug but did not find them. Please help to point out that why the problems occurred. Thanks!

Tips: The output mean of my code above is :

np.mean(question_overlap) 
0.8031111701203273

But the right answer is:

0.69087373156719623
leo022
  • 925
  • 6
  • 7
  • 1
    Could you provide example input and output please? See https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples and [mcve]. – Jon Clements Sep 01 '18 at 09:26
  • The output mean of my code above is : np.mean(question_overlap) 0.8031111701203273 But the right answer is: 0.69087373156719623 – leo022 Sep 01 '18 at 12:52

0 Answers0