Trying to implement a code that analyzes a dataframe row per row. Lookts at the sentence and applies bag of words approach to create new columns to be used as features for regression analysis.
Here's what I'm trying to replicate and have done successfully but am having a hard time making sure they are aligned on the row of the dataframe that i used apply to.
take a look at this sample:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
##func im trying to create
def per_row(row):
corpus = [row['bb']]
index = row.index.values
bag_of_words = vectorizer.fit_transform(corpus)
bag_of_words.toarray()
feature_names = vectorizer.get_feature_names()
display(pd.DataFrame(bag_of_words.toarray(), columns=feature_names))
##print(corpus,type(corpus))
#return pd.DataFrame(bag_of_words.toarray(), columns=feature_names)
# Create data frame
display('initial df',a)
a = pd.DataFrame([['a','Fast_Food,Budget_Friendly,Pasta'],
['b','Fast_Food,Asean,Pasta']
],columns=['aa','bb'])
#so far this is the approach i can think of to add new columns
#but how can i achieve it to be dynamic in a sense
#that the df output is joined on original df (a)
a.apply(per_row,axis=1)
# this is my desired outcome after the script runs.
#the classification per row is moved as
#dummy variables/features for use in regression
desired_outcome = pd.DataFrame([['a','Fast_Food,Budget_Friendly,Pasta',1,1,1,0],
['b','Fast_Food,Asean,Pasta',0,1,1,1]
],
columns=['aa','bb','budget_friendly','fast_food','pasta','asean'])
desired_outcome
Need help fixing per_row function so that it joins any new feature created by the bag of words vectorizer.
If theres a package that can perform the desired process, it will also be preferred. thanks in advance.