0

I have a pandas dataframe correct_X_test that contains one column review containing reviews. I need to add two new columns that contain parts of the reviews as below:

for one line of review review ='x1 x2 x3 x x x xi x x x xn', I need to stock sub_review_1_i='x1 x2 x3 x x x xi' and sub_review_i_n='xi x x x xn' for i in (1,n)

I extract the two strings using this code:

for j in correct_y_test.index:
  input_list=correct_X_test["review"][j].split()
  for i in range(len(input_list)):
    #Construction de la séquence de x1 à xi
    sub_list_1_i=input_list[:i+1]
    sub_str_1_i = ""
    for ele in sub_list_1_i:
      sub_str_1_i += ele + " "
    #Construction de la séquence de xi à xn
    sub_list_i_n=input_list[i:]
    sub_str_i_n = ""
    for ele in sub_list_i_n:
      sub_str_i_n += ele + " "

but don't see how to stock this in the dateframe because for a review we will have i rows and 2 columns any idea, please?

Timus
  • 10,974
  • 5
  • 14
  • 28
SLA
  • 87
  • 6
  • Please add a proper [MRE](https://stackoverflow.com/help/minimal-reproducible-example) (also look [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)) to your question that replicates your problem. – Timus Feb 14 '23 at 10:26
  • @user19077881 the split method has the space by default and my code does it correctly. it split word by word and not character by character. I just need to stock it in the dataframe. Thank you – SLA Feb 14 '23 at 10:30

1 Answers1

0

The way I see, you have two options:

Option 1: store the sub-reviews as lists

In this option, for every "review", you create two lists to store the values from sub_str_1_i, and another for sub_str_i_n. Then you add those lists as new columns in their respective rows. Here's an example:

import pandas as pd

# == Create some dummy data ====================================================
correct_X_test = pd.DataFrame({"review": ["This is a review",
                                          "This is another review",
                                          "This is a third review"]})

# == Solution 1 ================================================================
correct_X_test['1_i'] = None
correct_X_test['i_n'] = None

for j, row in correct_X_test.iterrows():
    input_list = row["review"].split()
    sub_list_1_i, sub_list_i_n = [], []
    for i in range(len(input_list)):

        # Construction de la séquence de x1 à xi
        sub_str_1_i = " ".join(input_list[:i+1])
        
        # Construction de la séquence de xi à xn
        sub_str_i_n = " ".join(input_list[i:])

        sub_list_1_i.append(sub_str_1_i)
        sub_list_i_n.append(sub_str_i_n)

    correct_X_test.loc[j, '1_i'] = sub_list_1_i
    correct_X_test.loc[j, 'i_n'] = sub_list_i_n

print(correct_X_test)
# Prints:
#
                #    review                                                1_i  \
# 0        This is a review       [This, This is, This is a, This is a review]   
# 1  This is another review  [This, This is, This is another, This is anoth...   
# 2  This is a third review  [This, This is, This is a, This is a third, Th...   

#                                                  i_n  
# 0  [This is a review, is a review, a review, review]  
# 1  [This is another review, is another review, an...  
# 2  [This is a third review, is a third review, a ...  

Option 2: create new rows for every combination of sub_str_1_i and sub_str_i_n

In this option, each combination of sub_str_1_i and sub_str_i_n are stored as new rows in the dataframe. You can use the method pd.DataFrame.explode to convert the output from Option 1 into new rows:

correct_X_test.explode(['i_n', '1_i'])
# Returns:
#
#                    review                     1_i                     i_n
# 0        This is a review                    This        This is a review
# 0        This is a review                 This is             is a review
# 0        This is a review               This is a                a review
# 0        This is a review        This is a review                  review
# 1  This is another review                    This  This is another review
# 1  This is another review                 This is       is another review
# 1  This is another review         This is another          another review
# 1  This is another review  This is another review                  review
# 2  This is a third review                    This  This is a third review
# 2  This is a third review                 This is       is a third review
# 2  This is a third review               This is a          a third review
# 2  This is a third review         This is a third            third review
# 2  This is a third review  This is a third review                  review
Ingwersen_erik
  • 1,701
  • 1
  • 2
  • 9