2

So my task is to clean up the data in a giant table, and create a linear regression model with the data. I noticed a problem that one of the columns store a bunch of tags in a string. The following snippet is two elements in that column.

"TV,Internet,Wifi,Kitchen, Pets live on this property ,Cat(s),Elevator, Free street parking ,Heating,Washer,Dryer, Smoke alarm , Carbon monoxide alarm , First aid kit , Fire extinguisher ,Essentials,Hangers, Hair dryer ,Iron, Laptop-friendly workspace , translation missing: en.hosting_amenity_49 , translation missing: en.hosting_amenity_50 , Hot water , Bed linens , Extra pillows and blankets ,Microwave, Coffee maker ,Refrigerator, Dishes and silverware , Cooking basics ,Oven,Stove, Long term stays allowed .", 
"TV,Internet,Wifi,Kitchen, Pets live on this property ,Dog(s), Free street parking , Buzzer/wireless intercom ,Heating, Family/kid friendly ,Washer,Dryer, Smoke detector , Carbon monoxide detector , Fire extinguisher , Hair dryer ,Iron, Laptop friendly workspace , translation missing: en.hosting_amenity_50 , Self check-in ,Lockbox, Private living room , Hot water ,Microwave, Coffee maker ,Refrigerator,Dishwasher, Dishes and silverware , Cooking basics ,Oven,Stove, Patio or balcony , Garden or backyard , Baking sheet , Trash can ."

I think it would be very inefficient for the model to learn this huge paragraph, I think the better way to do this is to store each tag as an element in a list or an array, and put the column of list/array into the model. This is my code for it:

library("stringr")

train['sorted_amenities'] <- NA
# add an empty column to the table

last_column = c()

for (j in 1:nrow(train)){

  splitted_x <- strsplit(as.character(train[j, 'amenities']), split = ',')
  # split the string into array/list of words, separated by comma

  trimmed_x = c()
  # initialize an empty array/list
  
  for (i in 1:length(splitted_x[[1]])){ # for each row in the column
    temp <- gsub("\\.", "", splitted_x[[1]][i])
    # get rid of the period sign that might be at the end of the string
    trimmed_x[i] <- str_trim(temp)
    # get rid of spaces that might be and the start/end of the string, and store into new array.list
  }
  
  last_column[j] <- list(trimmed_x)
  # add the each list to each element of last_column
  
}

train$sorted_amenities <- last_column
# put the list of lists into the empty new column of the table

After I did that, the new column in the table looks like this (only showing two elements as example):

c("TV", "Wifi", "Air conditioning", "Kitchen", "Heating", 
    "Washer", "Dryer", "Smoke detector", "Carbon monoxide detector", 
    "Fire extinguisher", "Essentials", "Hangers", "Hair dryer", 
    "Self check-in", "Keypad", "Private entrance"), c("TV", 
    "Internet", "Wifi", "Kitchen", "Pets live on this property", 
    "Dog(s)", "Free street parking", "Buzzer/wireless intercom", 
    "Heating", "Family/kid friendly", "Washer", "Dryer", 
    "Smoke detector", "Carbon monoxide detector", "Fire extinguisher", 
    "Hair dryer", "Iron", "Laptop friendly workspace", "translation missing: enhosting_amenity_50", 
    "Self check-in", "Lockbox", "Private living room", "Hot water", 
    "Microwave", "Coffee maker", "Refrigerator", "Dishwasher", 
    "Dishes and silverware", "Cooking basics", "Oven", "Stove", 
    "Patio or balcony", "Garden or backyard", "Baking sheet", 
    "Trash can")

So when I try to add that into the model parameter:

model = lm(price ~ host_name + sorted_amenities + host_verifications, data = train)

I get an error saying that the input cannot be lists.

Error in model.frame.default(formula = price ~ neighbourhood_group_cleansed + : invalid type (list) for variable 'sorted_amenities'

However, there is actually a column in the table that looks like a column of string lists, that looks like this:

"['email', 'phone', 'jumio', 'kba', 'selfie', 'government_id', 'identity_manual', 'work_email']", 
"['email', 'phone', 'reviews', 'jumio', 'government_id']"

I tried to make my new column the same format as this one, so I try to get the type of one object in this column:

id_914965 <- train[train$id == '279053',]
haha <- id_914965$host_verifications
print(haha)
print(typeof(haha))

And the printed result is an "integer".

[1] ['email', 'phone', 'reviews', 'jumio', 'government_id']
516 Levels: ['email', 'facebook', 'google', 'reviews', 'offline_government_id', 'selfie', 'government_id', 'identity_manual'] ...
[1] "integer"

I don't really get this, how can it be an integer? Am I not getting its type correctly?

I also followed some similar questions on Stack Overflow, like using "unlist" method, but in my case it did not work either.

And how can I put the list in to the param list? Am I dealing with the data correctly?

Thank you all in advance.


MRE:

minimal dataset:

structure(list(id = c(914965L, 262763L, 279053L), name = structure(1:3, .Label = c("<BRAND NEW BUILDING! 2 BR 1. 5 BA in Trendy Bushwick>", 
"<Brooklyn Brownstone parlor living with full A/C>", "<Posh one bedroom apartment in Brooklyn>"
), class = "factor"), host_name = structure(c(1L, 3L, 2L), .Label = c("Andrea", 
"Foster", "Jeremy"), class = "factor"), host_location = structure(c(2L, 
1L, 1L), .Label = c("New York, New York, United States", "US"
), class = "factor"), host_response_time = structure(c(2L, 3L, 
1L), .Label = c("within a day", "within a few hours", "within an hour"
), class = "factor"), host_verifications = structure(c(3L, 1L, 
2L), .Label = c("['email', 'phone', 'facebook', 'reviews', 'jumio', 'government_id', 'work_email']", 
"['email', 'phone', 'reviews', 'jumio', 'government_id']", "['phone']"
), class = "factor"), amenities = structure(c(2L, 1L, 3L), .Label = c("TV,Internet,Wifi,Kitchen, Pets live on this property ,Dog(s), Free street parking , Buzzer/wireless intercom ,Heating, Family/kid friendly ,Washer,Dryer, Smoke detector , Carbon monoxide detector , Fire extinguisher , Hair dryer ,Iron, Laptop friendly workspace , translation missing: en.hosting_amenity_50 , Self check-in ,Lockbox, Private living room , Hot water ,Microwave, Coffee maker ,Refrigerator,Dishwasher, Dishes and silverware , Cooking basics ,Oven,Stove, Patio or balcony , Garden or backyard , Baking sheet , Trash can .", 
"TV,Wifi, Air conditioning ,Kitchen,Heating,Washer,Dryer, Smoke detector , Carbon monoxide detector , Fire extinguisher ,Essentials,Hangers, Hair dryer , Self check-in ,Keypad, Private entrance .", 
"TV,Wifi, Air conditioning ,Kitchen,Heating,Washer,Dryer, Smoke detector , Carbon monoxide detector , First aid kit , Fire extinguisher ,Essentials,Shampoo, Lock on bedroom door ,Hangers, Hair dryer ,Iron, Laptop friendly workspace , Self check-in , Smart lock , Private living room , Private entrance , Room-darkening shades , Hot water , Bed linens ,Microwave, Coffee maker ,Refrigerator,Dishwasher, Dishes and silverware , Cooking basics ,Oven,Stove, Garden or backyard ."
), class = "factor")), row.names = c(NA, 3L), class = "data.frame")

Minimal code:

library("stringr")
train <- read.csv('path/file.csv', stringsAsFactors = T)

train['sorted_amenities'] <- NA
# add an empty column to the table

last_column = c()

for (j in 1:nrow(train)){

  splitted_x <- strsplit(as.character(train[j, 'amenities']), split = ',')
  # split the string into array/list of words, separated by comma

  trimmed_x = c()
  # initialize an empty array/list
  
  for (i in 1:length(splitted_x[[1]])){ # for each row in the column
    temp <- gsub("\\.", "", splitted_x[[1]][i])
    # get rid of the period sign that might be at the end of the string
    trimmed_x[i] <- str_trim(temp)
    # get rid of spaces that might be and the start/end of the string, and store into new array.list
  }
  
  last_column[j] <- list(trimmed_x)
  # add the each list to each element of last_column
  
}

train$sorted_amenities <- last_column
# put the list of lists into the empty new column of the table

model = lm(price ~ host_name + sorted_amenities + host_verifications, data = train)

Necessary Info:

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS  11.6

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.4.0

loaded via a namespace (and not attached):
 [1] compiler_4.0.3  magrittr_2.0.1  htmltools_0.5.0 tools_4.0.3     yaml_2.2.1      stringi_1.5.3   rmarkdown_2.11  knitr_1.36     
 [9] xfun_0.28       digest_0.6.28   rlang_0.4.12    evaluate_0.14  
GlenXoseph
  • 113
  • 1
  • 11
  • Please edit your post and replace the screenshots with space-saving text. – tpetzoldt Nov 16 '21 at 06:12
  • Can you post sample data? Please edit **the question** with the output of `dput(train)`. Or, if it is too big with the output of `dput(head(train))`. – Rui Barradas Nov 16 '21 at 06:32
  • @RuiBarradas Thank you, I did think about posting that, but the data set is really too big, with 80+ columns, only 1 of them is related to what I'm struggling about, and I posted the sample data of that column in the description already. – GlenXoseph Nov 16 '21 at 06:45
  • 1
    There's actually no need to share your data. Do you mind to read our tutorial of the R tag on [how-to-make-a-great-r-reproducible-example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? Thanks! – jay.sf Nov 16 '21 at 07:30
  • @tpetzoldt Thank you for the advice, the question is updated. – GlenXoseph Nov 19 '21 at 07:34
  • @jay.sf Thank you for your great resources, the question is updated accordingly. – GlenXoseph Nov 19 '21 at 07:35
  • ... but the "updated question" is still very long, so that only few SO users may read it. Nevertheless, I wish you good luck. – tpetzoldt Nov 19 '21 at 14:25

1 Answers1

1

Your data frame looks like something written or converted from python. Might be better of working with that.

Essentially amenities is a list, if I run your code:

class(train$sorted_amenities)
[1] "list"

And host_verifications is a character which is useless to you.

We need to onehot encode sorted_amenities, using strsplit like you did:

#create the list
lst = strsplit(as.character(train$amenities),"[ ]*[,][ ]*")
#collect all your possible values
allitems = unique(unlist(lst))
amenities_counts = t(sapply(lst,function(i)as.numeric(allitems %in% i)))
colnames(amenities_counts) = allitems
head(amenities_counts[,1:4])

     TV Wifi Air conditioning Kitchen
[1,]  1    1                1       1
[2,]  1    1                0       1
[3,]  1    1                1       1

Then we deal with your host_verifications :

lst = strsplit(gsub("[^a-z,_]*","",train$host_verifications),",")
#collect all your possible values
allitems = unique(unlist(lst))
veri_counts = t(sapply(lst,function(i)as.numeric(allitems %in% i)))
colnames(veri_counts) = allitems
head(veri_counts)

     phone email facebook reviews jumio government_id work_email
[1,]     1     0        0       0     0             0          0
[2,]     1     1        1       1     1             1          1
[3,]     1     1        0       1     1             1          0

Now make a final data.frame (I cannot see price in the data you provided):

final = data.frame(host_name = train$host_name,
        amenities_counts,veri_counts)

You can run your regression on those after removing columns with say only one '1' etc..

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • That was what I did in the end as well, creating a column of boolean for each amenity. Your methods are far more efficient and elegant than my for loops tho. Thank you for your detailed answer :) – GlenXoseph Nov 30 '21 at 00:03