0

I have a main data frame containing products info such as ID, description and category (and many other variables).

main.df <- structure(list(product.ID = 1:10, 
description = c("abc...", "bcd...", "def...", "efg...", "fgh...", 
"ghi...", "hij...", "ijk...", "jkl...", "klm..."), 
category = c("a", "b", "c", "d", "e", "a", "b", "c", "d", "e")),
 .Names = c("product.ID", "description", "category"), 
 row.names = c(NA, -10L), class = "data.frame")

Then, I have a second data frame which lists the class of product each specific category belong to:

classes.df <- structure(list(category = c("a", "b", "c", "d", "e"), 
classe = c("aaa", "bbb", "aaa", "ccc", "bbb")), 
.Names = c("category", "classe"), 
row.names = c(NA, -5L), 
class = "data.frame")

The "category" variables is what 'links' the 2 data frames.

I need to add a variable in the main.df to mention the class of product each row belongs to, but I don't know how to.

Considering that my actual main.df is 4.5 millions rows spread over 90,000+ categories and my actual classes.df has the 90,000+ rows corresponding to 120 classes, how can I do that. Thank you.

main.df structure is

Classes ‘data.table’ and 'data.frame':  250000 obs. of  16 variables:
 $ ID         : int  4722 6988 9184 13224 13511 15938 19244 21162 23294 23793 ...
 $ dataset    : Factor w/ 2 levels "BA", "RB",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ prodID     : num  429 429 429 429 429 429 429 429 429 429 ...
 $ ProdName   : chr  "aaa" "aaa" "bbb" "ccc" "eee" ...
 $ manufacID  : num  1 1 1 1 1 1 1 1 1 1 ...
 $ time       : num  1271636264 1062977828 1218368958 1305424000 1284596323 ...
 $ serial     : chr  "BA1" "BA1" "RB1" "RB7" ...
 - attr(*, "sorted")= chr "serial"
 - attr(*, ".internal.selfref")=<externalptr> 

classes.df structure is:

 Classes ‘data.table’ and 'data.frame': 20565 obs. of  5 variables:
   $ ID         : int  652 1204 1252 1379 2334 2335 2336 2337 3186 3187 ...
   $ mName   : chr  "XYZ" "EHD" "DLK" "TSH" ...
   $ country: chr  "Argentina" "USA" "UK" "Argentina" ...
   $ serial : chr  "RB7" "BA1" "RB97" "RB732" ...
   - attr(*, ".internal.selfref")=<externalptr>

(for confidentiality reasons, I had to anonymise the names)

Arun
  • 116,683
  • 26
  • 284
  • 387
Sal
  • 117
  • 12
  • If I understand correctly, you want to use `serial` column as the link variable. But, there is `ID` column which is also a common variable. In the expected result, how many columns will `main.df` have? – akrun Nov 15 '14 at 08:39

1 Answers1

1

Try data.table for bigger datasets

library(data.table)
setkey(setDT(main.df), category)
setDT(classes.df)
main.df[classes.df][order(product.ID),]
 #    product.ID description category classe
 #1:          1      abc...        a    aaa
 #2:          2      bcd...        b    bbb
 #3:          3      def...        c    aaa
 #4:          4      efg...        d    ccc
 #5:          5      fgh...        e    bbb
 #6:          6      ghi...        a    aaa
 #7:          7      hij...        b    bbb
 #8:          8      ijk...        c    aaa
 #9:          9      jkl...        d    ccc
#10:         10      klm...        e    bbb

Or using dplyr

 library(dplyr)
 left_join(main.df, classes.df, by='category')

A base R option would be to use merge (would be slower)

 merge(main.df, classes.df, by='category', all.x=TRUE)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you, Akrun. A few comments: 1) the data.table method works fine on the example but always gives me an error message on the real dataset "Error in `[.data.table`(main.df, classes.df) : x.'serial' is a character column being joined to i.'ID' which is type 'integer'. Character columns must join to factor or character columns."; 2) the base R methods works fine but it cancels the rows that do not belong to any categgory; 3) the ply method is great: fast and adds NAs to rows classes which do not belong to existing categories. – Sal Nov 15 '14 at 08:11
  • @Sal Regarding the comment 1), Could you show the `str(main.df)` and `str(classes.df)` and if possible could you run on a small subset of both datasets to check if the error exists. 2), You may have to use `all=TRUE` within the `merge`. – akrun Nov 15 '14 at 08:18
  • @Sal Please paste it on your post. It is a bit hard to read from the comments. – akrun Nov 15 '14 at 08:32
  • @Sal You have 5 variables in `classes.df`. In the example you had only 2 variables. Which is the variable you want to join ito `main.df`? Have you tried the code on a small subset of each dataset? Also, can you `dput(head(main.df))`, `dput(head(classes.df))` and paste the output in your post. – akrun Nov 15 '14 at 08:34
  • the `serial` variable is the "link" between the 2 datasets. I need to add the corresponding `country` from the classes.df into the main.df – Sal Nov 15 '14 at 08:54
  • @Sal Then subset the `classes.df`. `classes.df1 <- classes.df[,c('country', 'serial')]` and apply the codes. Hope it works. – akrun Nov 15 '14 at 09:10
  • yes, it works. Thank you. dplyr and data.table methods are both very fast on the complete dataset (4.5 million rows) – Sal Nov 16 '14 at 07:19