2

I am using two large data files, each having >2m records. The sample data frames are

x <- data.frame("ItemID" = c(1,2,1,1,3,4,2,3,4,1), "SessionID" = c(111,112,111,112,113,114,114,115,115,115), "Avg" = c(1.0,0.45,0.5,0.5,0.46,0.34,0.5,0.6,0.10,0.15),"Category" =c(0,0,0,0,0,0,0,0,0,0))
y <- data.frame("ItemID" = c(1,2,3,4,3,4,5,7),"Category" = c("1","0","S","120","S","120","512","621"))

I successfully filled the x$Category using following command

x$Category <- y$Category[match(x$ItemID,y$ItemID)]

but

x$Category

gave me

[1] 1   0   1   1   S   120 0   S   120 1  
Levels: 0 1 120 512 621 S

In x there are only four distinct categories but the Levels shows six. Similarly, the frequency shows me 512 and 621 with 0 frequency. I am using the same data for classification where it shows six classes instead of four which effects the f measure and recall etc. negatively.

table(x$Category)
0   1 120 512 621   S 
2   4   2   0   0   2 

while I want

table(x$Category)
0   1 120  S 
2   4   2  2 

I tried merge this and this with a number of other questions but it is giving me an error message. I found here Practical limits of R data frame that it is the limitation of R.

Dr. Abrar
  • 327
  • 2
  • 5
  • 17

1 Answers1

2

I would omit the Category column from your x data.frame, since it seems to only be serving as a placeholder until values from the y data.frame are filled in. Then, you can use left_join from dplyr with ItemID as the key variable, followed by droplevels() as suggested by TingITangIBob.

This gets you close, but my table does not exactly match yours:

dplyr::select(x, -Category) %>%
dplyr::left_join(y, by = "ItemID") %>%
droplevels()

0 1 120 S

2 4 4 4

I think this may have to do with the repeat ItemIDs in x?

LLeki
  • 46
  • 5
  • 1
    Thank you, but as I mentioned, join and merge are not working with my dataset. It always crash at some level. – Dr. Abrar Oct 24 '18 at 14:33