So my task is to clean up the data in a giant table, and create a linear regression model with the data. I noticed a problem that one of the columns store a bunch of tags in a string. The following snippet is two elements in that column.
"TV,Internet,Wifi,Kitchen, Pets live on this property ,Cat(s),Elevator, Free street parking ,Heating,Washer,Dryer, Smoke alarm , Carbon monoxide alarm , First aid kit , Fire extinguisher ,Essentials,Hangers, Hair dryer ,Iron, Laptop-friendly workspace , translation missing: en.hosting_amenity_49 , translation missing: en.hosting_amenity_50 , Hot water , Bed linens , Extra pillows and blankets ,Microwave, Coffee maker ,Refrigerator, Dishes and silverware , Cooking basics ,Oven,Stove, Long term stays allowed .",
"TV,Internet,Wifi,Kitchen, Pets live on this property ,Dog(s), Free street parking , Buzzer/wireless intercom ,Heating, Family/kid friendly ,Washer,Dryer, Smoke detector , Carbon monoxide detector , Fire extinguisher , Hair dryer ,Iron, Laptop friendly workspace , translation missing: en.hosting_amenity_50 , Self check-in ,Lockbox, Private living room , Hot water ,Microwave, Coffee maker ,Refrigerator,Dishwasher, Dishes and silverware , Cooking basics ,Oven,Stove, Patio or balcony , Garden or backyard , Baking sheet , Trash can ."
I think it would be very inefficient for the model to learn this huge paragraph, I think the better way to do this is to store each tag as an element in a list or an array, and put the column of list/array into the model. This is my code for it:
library("stringr")
train['sorted_amenities'] <- NA
# add an empty column to the table
last_column = c()
for (j in 1:nrow(train)){
splitted_x <- strsplit(as.character(train[j, 'amenities']), split = ',')
# split the string into array/list of words, separated by comma
trimmed_x = c()
# initialize an empty array/list
for (i in 1:length(splitted_x[[1]])){ # for each row in the column
temp <- gsub("\\.", "", splitted_x[[1]][i])
# get rid of the period sign that might be at the end of the string
trimmed_x[i] <- str_trim(temp)
# get rid of spaces that might be and the start/end of the string, and store into new array.list
}
last_column[j] <- list(trimmed_x)
# add the each list to each element of last_column
}
train$sorted_amenities <- last_column
# put the list of lists into the empty new column of the table
After I did that, the new column in the table looks like this (only showing two elements as example):
c("TV", "Wifi", "Air conditioning", "Kitchen", "Heating",
"Washer", "Dryer", "Smoke detector", "Carbon monoxide detector",
"Fire extinguisher", "Essentials", "Hangers", "Hair dryer",
"Self check-in", "Keypad", "Private entrance"), c("TV",
"Internet", "Wifi", "Kitchen", "Pets live on this property",
"Dog(s)", "Free street parking", "Buzzer/wireless intercom",
"Heating", "Family/kid friendly", "Washer", "Dryer",
"Smoke detector", "Carbon monoxide detector", "Fire extinguisher",
"Hair dryer", "Iron", "Laptop friendly workspace", "translation missing: enhosting_amenity_50",
"Self check-in", "Lockbox", "Private living room", "Hot water",
"Microwave", "Coffee maker", "Refrigerator", "Dishwasher",
"Dishes and silverware", "Cooking basics", "Oven", "Stove",
"Patio or balcony", "Garden or backyard", "Baking sheet",
"Trash can")
So when I try to add that into the model parameter:
model = lm(price ~ host_name + sorted_amenities + host_verifications, data = train)
I get an error saying that the input cannot be lists.
Error in model.frame.default(formula = price ~ neighbourhood_group_cleansed + : invalid type (list) for variable 'sorted_amenities'
However, there is actually a column in the table that looks like a column of string lists, that looks like this:
"['email', 'phone', 'jumio', 'kba', 'selfie', 'government_id', 'identity_manual', 'work_email']",
"['email', 'phone', 'reviews', 'jumio', 'government_id']"
I tried to make my new column the same format as this one, so I try to get the type of one object in this column:
id_914965 <- train[train$id == '279053',]
haha <- id_914965$host_verifications
print(haha)
print(typeof(haha))
And the printed result is an "integer".
[1] ['email', 'phone', 'reviews', 'jumio', 'government_id']
516 Levels: ['email', 'facebook', 'google', 'reviews', 'offline_government_id', 'selfie', 'government_id', 'identity_manual'] ...
[1] "integer"
I don't really get this, how can it be an integer? Am I not getting its type correctly?
I also followed some similar questions on Stack Overflow, like using "unlist" method, but in my case it did not work either.
And how can I put the list in to the param list? Am I dealing with the data correctly?
Thank you all in advance.
MRE:
minimal dataset:
structure(list(id = c(914965L, 262763L, 279053L), name = structure(1:3, .Label = c("<BRAND NEW BUILDING! 2 BR 1. 5 BA in Trendy Bushwick>",
"<Brooklyn Brownstone parlor living with full A/C>", "<Posh one bedroom apartment in Brooklyn>"
), class = "factor"), host_name = structure(c(1L, 3L, 2L), .Label = c("Andrea",
"Foster", "Jeremy"), class = "factor"), host_location = structure(c(2L,
1L, 1L), .Label = c("New York, New York, United States", "US"
), class = "factor"), host_response_time = structure(c(2L, 3L,
1L), .Label = c("within a day", "within a few hours", "within an hour"
), class = "factor"), host_verifications = structure(c(3L, 1L,
2L), .Label = c("['email', 'phone', 'facebook', 'reviews', 'jumio', 'government_id', 'work_email']",
"['email', 'phone', 'reviews', 'jumio', 'government_id']", "['phone']"
), class = "factor"), amenities = structure(c(2L, 1L, 3L), .Label = c("TV,Internet,Wifi,Kitchen, Pets live on this property ,Dog(s), Free street parking , Buzzer/wireless intercom ,Heating, Family/kid friendly ,Washer,Dryer, Smoke detector , Carbon monoxide detector , Fire extinguisher , Hair dryer ,Iron, Laptop friendly workspace , translation missing: en.hosting_amenity_50 , Self check-in ,Lockbox, Private living room , Hot water ,Microwave, Coffee maker ,Refrigerator,Dishwasher, Dishes and silverware , Cooking basics ,Oven,Stove, Patio or balcony , Garden or backyard , Baking sheet , Trash can .",
"TV,Wifi, Air conditioning ,Kitchen,Heating,Washer,Dryer, Smoke detector , Carbon monoxide detector , Fire extinguisher ,Essentials,Hangers, Hair dryer , Self check-in ,Keypad, Private entrance .",
"TV,Wifi, Air conditioning ,Kitchen,Heating,Washer,Dryer, Smoke detector , Carbon monoxide detector , First aid kit , Fire extinguisher ,Essentials,Shampoo, Lock on bedroom door ,Hangers, Hair dryer ,Iron, Laptop friendly workspace , Self check-in , Smart lock , Private living room , Private entrance , Room-darkening shades , Hot water , Bed linens ,Microwave, Coffee maker ,Refrigerator,Dishwasher, Dishes and silverware , Cooking basics ,Oven,Stove, Garden or backyard ."
), class = "factor")), row.names = c(NA, 3L), class = "data.frame")
Minimal code:
library("stringr")
train <- read.csv('path/file.csv', stringsAsFactors = T)
train['sorted_amenities'] <- NA
# add an empty column to the table
last_column = c()
for (j in 1:nrow(train)){
splitted_x <- strsplit(as.character(train[j, 'amenities']), split = ',')
# split the string into array/list of words, separated by comma
trimmed_x = c()
# initialize an empty array/list
for (i in 1:length(splitted_x[[1]])){ # for each row in the column
temp <- gsub("\\.", "", splitted_x[[1]][i])
# get rid of the period sign that might be at the end of the string
trimmed_x[i] <- str_trim(temp)
# get rid of spaces that might be and the start/end of the string, and store into new array.list
}
last_column[j] <- list(trimmed_x)
# add the each list to each element of last_column
}
train$sorted_amenities <- last_column
# put the list of lists into the empty new column of the table
model = lm(price ~ host_name + sorted_amenities + host_verifications, data = train)
Necessary Info:
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS 11.6
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.4.0
loaded via a namespace (and not attached):
[1] compiler_4.0.3 magrittr_2.0.1 htmltools_0.5.0 tools_4.0.3 yaml_2.2.1 stringi_1.5.3 rmarkdown_2.11 knitr_1.36
[9] xfun_0.28 digest_0.6.28 rlang_0.4.12 evaluate_0.14