r remove records that dont represent all groups

Question

After manipulating raw data we have obtained following data.frame

        ItemID    GroupID mentions
1         601          3     1
2         601          4     1
3         611          3     1
4         661          3     1
5         801          3     1
6         821          3     1
6         841          1     3
6         841          2     3
6         841          3     3
6         841          4     3

I have 10000 records like this and my first goal is to figure our items that represent all 4 GroupID. First I tried to do this visually by plotting.

ggplot(item.stats, aes(x=ItemID, y=mentions, fill=GroupID)) + 
  geom_bar(stat='identity', position='dodge')

With the large dataset this didn't look like a sensible thing. What's best way to get good idea of how many items represent all groups and mentions the mentions.

In above example after filtering it should only have:

        ItemID    GroupID mentions
6         841          1     3
6         841          2     3
6         841          3     3
6         841          4     3

Trying to get meaningful visualization:

test.with.id <- transform(test,id=as.numeric(factor(ItemID)))
ggplot(test.with.id, aes(x=id, y=mentions, fill=GroupID)) + 
  geom_histogram(stat='identity', position='stack', binwidth = 2)

May be similar to this How to plot multiple stacked histograms together in R?

Suppose your data is in `dat1`: `with(dat1, ave(GroupID, ItemID, FUN = function(x) length(unique(x))))` — bouncyball, Oct 31 '17 at 18:09

score 1 · Accepted Answer · answered Oct 31 '17 at 18:28

1

You can group by ItemID, then filter based on if all 4 Group IDs are in the GroupID column:

df %>% group_by(ItemID) %>% filter(all(1:4 %in% GroupID))

# A tibble: 4 x 3
# Groups:   ItemID [1]
#  ItemID GroupID mentions
#   <int>   <int>    <int>
#1    841       1        3
#2    841       2        3
#3    841       3        3
#4    841       4        3

answered Oct 31 '17 at 18:28

Psidom

209,562
33
339
356

Whats the best way to plot this. Density plot? – add-semi-colons Nov 01 '17 at 17:37
1

I am not sure what you want to visualize. What does *GroupID* and *mentions* mean here? – Psidom Nov 01 '17 at 17:55
I was trying to do stack or dodge histogram Just edited the question with the ggplot. Mentions are facebook likes. Group you can think as a bucket. – add-semi-colons Nov 01 '17 at 18:04
1

Possibly `ggplot(df, aes(mentions, fill=factor(GroupID))) + geom_density(alpha = 0.2)`? The `ItemID` seems to be too specific information to enter the graph. Just a guess. Also see this question about [stacked histogram](https://stackoverflow.com/questions/3541713/how-to-plot-two-histograms-together-in-r). – Psidom Nov 01 '17 at 18:15
Sorry just edited i also tried stack or dodge histogram but they are giving me strange numbers on y axis – add-semi-colons Nov 01 '17 at 18:16
1

`histogram` doesn't need `y` aesthetics, you might need `geom_bar`. Try `ggplot(df, aes(x=ItemID, y=mentions, fill=factor(GroupID))) + geom_bar(position='stack', stat = 'identity')` – Psidom Nov 01 '17 at 18:19
stacking give y axis range that is not actually in mentions. example maximum is 50 but showing 350. could it be getting added? – add-semi-colons Nov 01 '17 at 18:29
I guess that's because `position='stack'` , so it stacks *mentions* from different *groups*, if you don't want that behavior, you can set `position='dodge'`. – Psidom Nov 01 '17 at 18:32

r remove records that dont represent all groups

1 Answers1