How to use a dictionary for a large data frame in R?

Question

I read the answers about creating dictionary in r.

Is there a dictionary functionality in R

And I have a question: how could I use this in a large dataset? Data structure is like this:

dput of a subsample is:

structure(list(...1 = c("category 1", NA, NA, NA, "total", "category 2", 
NA, NA, NA, "total"), Items = c("product 1", "product 2", "product 3", 
"product 4", NA, "product 1", "product 2", "product 3", "product 4", 
NA), price = c(1, 2, 3, 4, 10, 3, 4, 5, 6, 18)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

And I want the result be like:

categoryx: {prodcut1:1, product2:2, product3:3....}

What could I do if the there are 1000 categories and the number of products for each category is different? The answers in above two links, values of each key should be added manually, I don't how to use it for a large dataset.

Or is there other method (except create dictionaries) that could let me extract information of each category easily?

Could someone give ideas about this question? Thanks.

Is it possible to have a result like a dictionary(or list) of dictionaries in python?

such as dict={category1: {prodcut1:1, product2:2, product3:3....}, category2: {prodcut1:3, product2:4, product3:5....} }

So I could know categories's index and use the index to extract information from dict, and maybe it is like such a dataframe:

            item      price

categoryx    product1   2
             product2   3

so I could do operations for specific category?

The data structures `dictionary`, `set` doesn't have an exact structure in `R`. Closest is named `list` or a json structure — akrun, Sep 29 '20 at 00:39

akrun · Answer 1 · 2020-09-29T00:34:38.050

2

The first column name started with ..., so it was renamed to 'grp', then use fill from tidyr to replace the NA elements with the previous non-NA element, filter out the rows where the 'Items' are NA, unite the columns 'Items', 'price' to a single column by concatenating with sep as ":", grouped by 'grp', we summarise the 'ItemsPrice' by creating a single string with str_c

library(dplyr)
library(tidyr)
library(stringr)
df1 %>% 
   rename(grp = `...1`) %>% 
   fill(grp) %>%
   filter(!is.na(Items)) %>% 
   unite(ItemsPrice, Items, price, sep=":") %>%
   group_by(grp) %>%
   summarise(ItemsPrice = str_c(ItemsPrice, collapse = ", "))

-output

# A tibble: 2 x 2
#  grp        ItemsPrice                                        
#  <chr>      <chr>                                             
#1 category 1 product 1:1, product 2:2, product 3:3, product 4:4
#2 category 2 product 1:3, product 2:4, product 3:5, product 4:6

edited Sep 29 '20 at 00:34

answered Sep 29 '20 at 00:13

akrun

874,273
37
540
662

Thanks! And if I want to do operations in each category, such as calculate total price of several products in one category, should I extract the category, then convert it to data frame? Because I found products' information is all in one row for one category in output here. – ling Sep 29 '20 at 00:28
@ling As I mentioned in the comments, it is not clear about the expected output – akrun Sep 29 '20 at 00:29
1

@ling: You should avoid creating structures like that. Instead learn to use database methods. The best way to proceed would be to fill in the missing entries in the category column, possibly with `zoo::locf` and throw away the totals. They can be calculated later and you can display the results with `ftable` or other reporting functions, but leave the data in a normalized form. – IRTFM Sep 29 '20 at 00:41
@IRTFM Understand. I will try. Thank you! – ling Sep 29 '20 at 00:45
@IRTFM The method from above answer also works. This could remove total row and add fill all NA in category with categoryx. and maintain orginal dataset's form. The code is `dat_clean1 <- tidyr::fill(df[!is.na(df[["Items"]]), ], 1)` – ling Sep 29 '20 at 01:27
@ling: I didn't say it didn't "work". I said it wasn't a good idea to go this route. It will result in a data structure that is much more difficult to manipulate. – IRTFM Sep 29 '20 at 17:42
@IRTFM Oh, sorry for causing misunderstanding. The code in my answer is just an alternative method of zoo::locf, I add this as a note for anyone who view this question and want to fill in the missing entries as you mentioned. Thank you for your helpful idea. – ling Sep 30 '20 at 19:19

score 2 · Accepted Answer · answered Sep 29 '20 at 00:52

2

A list of hashmap dictionaries:

dat <-
  structure(
    list(
      ...1 = c("category 1", NA, NA, NA, "total", "category 2",
               NA, NA, NA, "total"),
      Items = c(
        "product 1",
        "product 2",
        "product 3",
        "product 4",
        NA,
        "product 1",
        "product 2",
        "product 3",
        "product 4",
        NA
      ),
      price = c(1, 2, 3, 4, 10, 3, 4, 5, 6, 18)
    ),
    row.names = c(NA,-10L),
    class = c("tbl_df", "tbl", "data.frame")
  )

library(hashmap)

dat_clean <- tidyr::fill(dat[!is.na(dat[["Items"]]), ], 1)

list_of_dicts <- lapply(split(dat_clean, dat_clean[[1]]), function(d){
  hashmap(d[["Items"]], d[["price"]])  
})

list_of_dicts
# $`category 1`
# ## (character) => (numeric)  
# ## [product 1] => [+1.000000]
# ## [product 3] => [+3.000000]
# ## [product 4] => [+4.000000]
# ## [product 2] => [+2.000000]
# 
# $`category 2`
# ## (character) => (numeric)  
# ## [product 1] => [+3.000000]
# ## [product 3] => [+5.000000]
# ## [product 4] => [+6.000000]
# ## [product 2] => [+4.000000]


# get totals:
lapply(list_of_dicts, function(dict){
  sum(dict$values())
})
# $`category 1`
# [1] 10
# 
# $`category 2`
# [1] 18

answered Sep 29 '20 at 00:52

Stéphane Laurent

75,186
15
119
225

And you could get keys of each category using: `keys(list_of_dicts$`category 1`)`, the result would be `[1] "product 1" "product 2" "product 3" "product 4"` – ling Sep 29 '20 at 01:22
get values `values(list_of_dicts$`category 1`)`, and result is `product 1 product 2 product 3 product 4 1 2 3 4 ` – ling Sep 29 '20 at 01:22
thank you very much. And for users with Version 1.2.5019, could use hash instead of hashmap(already be removed from CRAN). And the code could be: `list_of_dicts <- lapply(split(dat_clean, dat_clean[[1]]), function(d){ hash(d[["Items"]], d[["price"]]) })` and `lapply(list_of_dicts, function(dict){ sum(values(dict)) })` – ling Sep 29 '20 at 01:24
@ling `hashmap` is removed from CRAN?? Not cool. It is very performant. And I don't see any substantial interest of `hash` as compared to an ordinary `list` (with a `list` you can simply use `names` to get the keys). – Stéphane Laurent Sep 29 '20 at 01:28
Yes, what a pity. however, `hash` could create the dictionaries. and `values(dict)` in `hash` is the `dict$values()` in `hashmap` – ling Sep 29 '20 at 01:31
1

@ling I think an ordinary list is faster than `hash`. And with an ordinary list, you can do `unlist(dict)` to get the values. Even faster, use a list of vectors instead of a list of dictionaries/lists. – Stéphane Laurent Sep 29 '20 at 01:36

score 2 · Answer 3 · answered Sep 29 '20 at 01:48

You can use zoo::na.locf to fill the category values

names(df)[1] <- 'category'
df$category <- zoo::na.locf(df$category)
df <- subset(df, category != 'total')
df

# A tibble: 8 x 3
#  category   Items     price
#  <chr>      <chr>     <dbl>
#1 category 1 product 1     1
#2 category 1 product 2     2
#3 category 1 product 3     3
#4 category 1 product 4     4
#5 category 2 product 1     3
#6 category 2 product 2     4
#7 category 2 product 3     5
#8 category 2 product 4     6

I would keep the data as above in long format since all the libraries and base R allow grouped operations. So you can calculate anything for each category. I don't see any benefit of complicating the structure beyond this.

If you want the data to be in separate dataframes we can use split.

list_df <- split(df[-1], df$category)

Now you can have access to each individual category in separate dataframe. For example, to get data for category 1 you can do :

list_df$`category 1`

# A tibble: 4 x 2
#  Items     price
#  <chr>     <dbl>
#1 product 1     1
#2 product 2     2
#3 product 3     3
#4 product 4     4

How to use a dictionary for a large data frame in R?

3 Answers3

Linked