2

I have a dataframe with 117206 rows and 4 columns userid,itemid,rating and date. The structure of the dataframe is given below.

 'data.frame':  117206 obs. of  4 variables:
 $ userId: Factor w/ 19043 levels "1","2","3","4",..: 1 1 2 3 3 3 4 5 5 5 ...
 $ itemId: Factor w/ 11451 levels "2844","4936",..: 7402 9729 3404 2976 7932 10035 11093 6718 8297 8537 ...
 $ rating: int  7 8 10 8 8 7 10 2 7 5 ...
 $ time  : Date, format: "2013-04-03" "2013-04-21" "2013-09-18" ...

The head of the data is

userId  itemId rating       time
1      1 1074638      7 2013-04-03
2      1 1853728      8 2013-04-21
3      2  113277     10 2013-09-18
4      3  104257      8 2013-03-31
5      3 1259521      8 2013-03-24
6      3 1991245      7 2013-03-24

The tail of the data is

  userId  itemId rating       time
117201  19041 2171867      3 2013-09-16
117202  19041 2357129      5 2013-09-21
117203  19041 2381931      4 2013-09-08
117204  19042  816711      8 2013-06-23
117205  19043 1559547      2 2013-07-08
117206  19043 2415464      2 2013-07-14

I am trying to make a histogram using ggplot and it does not seem to be working. There are a couple of problems which are stated below:

  1. The count on the y-axis are not correct
  2. x-axis labels are not displayed at all

I am using the following code to draw a histogram and I have used the same code to make a correct plot for a different data set of similar kind but with 100K rows.

First I have created x-axis labels

labels_mtweet = seq(1,length(unique(m_tweet$itemId)),by=600)

so I have labels from 1 to 11451.

ggplot(m_tweet)+geom_histogram(aes(x=itemId))+
  scale_x_discrete(breaks=labels_mtweet, labels=as.character(labels_mtweet))+
  labs(x="Movie Id", y = "Number of ratings per movie", 
       title = "Distribution of ratings per movie - MovieTweetings")

Above is the code I am using to draw a histogram. When i make a simple plot, the values are displayed correctly using table.

plot(table(m_tweet$itemId),xlab=("Movie Id"),ylab=("Frequency of Movie Rating"),
    main=("Distribution of Ratings per movie - MovieLens"),type="l")

but when trying to get it done with ggplot. The bars are not of correct height and x-labels are not displayed at all.

I would like to paste the ggplot in here but for policy reasons I cant. Can anyone spot where things are going wrong?I think I am missing something in here that is causing the problem.

Any or all help will be greatly appreciated. I have not provided the output from 'dput' as it is very long.

Thanks.

syebill
  • 543
  • 6
  • 23
  • 1
    If you want a histogram, why are you using `geom_bar` rather than `geom_histogram`? – joran Jan 23 '15 at 17:20
  • I have corrected the error that was a mistake from my side – syebill Jan 23 '15 at 17:32
  • A [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) would be helpful. I can't seem to replicate your error. Please provide sample input that does. – MrFlick Jan 23 '15 at 18:21
  • For convenience I have added the file at a github repo. The link is "https://github.com/smbilal/datasharing/blob/master/ratings.dat". This is the original file and hope it will help in replicating the problem using the code provided in question. – syebill Jan 23 '15 at 19:11
  • Running my code I realized that I can actually not plot everyhing properly - Is it an instance of this problem? https://groups.google.com/forum/#!topic/ggplot2/XImz-gJOVlk – CMichael Jan 23 '15 at 20:01

1 Answers1

1

As per my comment your code (or my variant below) could in principle work but does not because there are more than 128 discrete categories...

ggplot(m_tweet)+geom_histogram(aes(x=as.factor(itemId)))+
  scale_x_discrete(breaks=labels_mtweet, labels=as.character(labels_mtweet))+
  labs(x="Movie Id", y = "Number of ratings per movie", 
       title = "Distribution of ratings per movie - MovieTweetings")

Given the limitation on the number of x values for a discrete scale we cannot get this to work. You may want to consider summarizing your data, e.g.:

require(plyr)
summarizedData <- ddply(m_tweet, c("itemId"), summarise,N    = length(rating))

Then you can circumvent using geom_histogram and plot the counts as a geom_line over a continuous x axis:

ggplot(summarizedData)+geom_line(aes(x=(itemId),y=N))+
  labs(x="Movie Id", y = "Number of ratings per movie", 
       title = "Distribution of ratings per movie - MovieTweetings")

enter image description here

CMichael
  • 1,856
  • 16
  • 20
  • I used the same code for 1682 discrete x categories and it worked but I do not know why it is not working in the case of 11451 discrete categories. It worked when I used summarize. – syebill Jan 24 '15 at 00:30