1

What if I have the following data, test_df['review_id'] that contains the id of the dataframe. I need to pair each of them with data from other arrays. I am going to have a code like the following.

def classify_nb_report(X_train_vectorized, y_train, X_test_vectorized, y_test):
    clf = MultinomialNB()

    # TRAIN THE CLASSIFIER WITH AVAILABLE TRAINING DATA
    clf.fit(X_train_vectorized, y_train)

    y_pred_class = clf.predict(X_test_vectorized)

    return y_pred_class

for i in range(0, n_loop):
    train_df, test_df = train_test_split(df, test_size=0.3)
    ....
    nb_y = classify_nb_report(X_train_vectorized, y_train, X_test_vectorized, y_test)

As you can see above, in each iteration I am going to get a new set of nb_y which is a numpy array. I am also going to have different sets of test_df and train_df (which are randomly chosen by the function above). I want to pair each value of nb_y from each iteration to id that matches test_df['review_id'].

With the following code, I can get the id of test_df side by side with the value from nb_y.

for f, b in zip(test_df['review_id'], nb_y):
    print(f, b)

Result:

17377 5.0
18505 5.0
24825 1.0
16032 5.0
23721 1.0
18008 5.0

Now, what I want is, from the result above, I append the values of nb_y from the next iterations to their corresponding ids.

I hope this is not too confusing, I will try to expand more if my question is not clear enough. Thanks in advance.

catris25
  • 1,173
  • 3
  • 20
  • 40
  • I think you could use a dictionary of lists for your problem. The key would be the id and the list would include all the nb_y values. I am not sure if this is what you want or if what I'm saying is clear. I could write down a detailed answer later if it is needed. – MattSt May 10 '18 at 11:41
  • @MattSt yeah, I was also thinking of using dictionary too. But here in each iteration, there'll always be new values added from the `nb_y` to the corresponding ids. And I am not sure how to modify dictionary in each loop like that. – catris25 May 10 '18 at 12:11
  • You should append to the dictionary if the id is already in dictionary.keys(). Otherwise you should add a list with the first nb_y element (e.g. dictionary[id] = [nb_y]). I could write the code for you in an answer, is this what you want though? It is not clear to me. – MattSt May 10 '18 at 12:39
  • @MattSt I have the ids in `test_df['review_id']` as I mentioned in the original post. Sure, just post the code, and I will see it. – catris25 May 10 '18 at 12:43

2 Answers2

0

I am not sure if I understand the problem correctly and how the rest of your code works but I assume the following code might do what you need. Let me know if it works or if there is something wrong with the answer.

dictionary = {}
for i in range(0, n_loop):
    train_df, test_df = train_test_split(df, test_size=0.3)
    ....
    nb_y = classify_nb_report(X_train_vectorized, y_train, X_test_vectorized, y_test)
    id = test_df['review_id']
    if not id in dictionary.keys():
        dictionary[id] = [nb_y]
    else:
        dictionary[id].append(nb_y)
MattSt
  • 1,024
  • 2
  • 16
  • 35
  • I tried your code, but I am not really sure how an empty `dictionary` will work. Thanks for the effort though, I really appreciate it. I have come up with my own solution above. – catris25 May 10 '18 at 17:04
  • When you write dictionary[id] = [nb_y] the dictionary element with key id will be initialized. You do not have to initialize the keys from the beginning for the dictionary (at least in python 3.6.4). I had a bug in my if statement though which I just edited. – MattSt May 10 '18 at 17:50
0

After referring to this and this, I finally came up with my own solution. I turned the code above into something like this.

def classify_nb_report(X_train_vectorized, y_train, X_test_vectorized, y_test):
    clf = MultinomialNB()

    # TRAIN THE CLASSIFIER WITH AVAILABLE TRAINING DATA
    clf.fit(X_train_vectorized, y_train)

    y_pred_class = clf.predict(X_test_vectorized)

    return y_pred_class


nb_y_list = []

for i in range(0, n_loop):
    train_df, test_df = train_test_split(df, test_size=0.3)
    ....
    nb_y = classify_nb_report(X_train_vectorized, y_train, X_test_vectorized, y_test)

    nb_y_list.extend([list(x) for x in zip(test_df['review_id'],nb_y)])

dd = defaultdict(list)
for key, val in nb_y_list:
     dd[key].append(val)
     print(dd)

Basically, I made an empty list called nb_y_list first. Then for each iteration, I zip the id from test_df['review_id'] to be parallel with the value from nb_y, and extend them to the previous nb_y_list. After all the loops are finished, I will get the complete list that I now I will need to convert to dictionary using defaultdict().

catris25
  • 1,173
  • 3
  • 20
  • 40