0

I am using dplyr to try and count a certain number of rows another column has

table

so the output of this would return something like:

facebook 2  
hello.com 1  
news.com 1    
yahoo    1

It is similar to a

select hostname, count(request)   
from data  
group by hostname

in mysql.

I have looked at the dplyr package, but am not sure how to specify which variable I want to group by

edit: The data is from NASA's http requests from this link site

Chris
  • 1,101
  • 2
  • 9
  • 14
  • You can do `group_by(hostname)`, see the examples in [the documentation](https://www.rdocumentation.org/packages/dplyr/versions/0.7.3/topics/group_by). – Florian Feb 22 '18 at 16:09
  • 1
    Please do not provide your data through an image. We cannot cut and paste that into R and no one wants to type it all in again. – G5W Feb 22 '18 at 16:13
  • You should be able to get the answer that you are looking for with `table(df$hostname)` - assuming that your data.frame is named df – G5W Feb 22 '18 at 16:14
  • I would go with a `data.table` approach if you're more used to an SQL type of query. Here's an answer I posted that ilustrates this point for a different question : https://stackoverflow.com/questions/48170427/r-split-observation-values-by-and-aggregate-to-time-intervals/48174944?noredirect=1#comment84020570_48174944 – Gautam Feb 22 '18 at 16:15

1 Answers1

1

Here's the data.table approach. I used the data you provided to illustrate the concept - going forward please provide data so others can reproduce the problem (as pointed out in the comments).

DATA

library(data.table)
dt <- data.table(hostname = c("hello.com", "news.com", "facebook", "yahoo", "facebook"),
                 request = c("GET /blah/blah", "GET /hello", "GET /no", "GET /yes", "GET /hello"))

CODE

> dt
    hostname        request
1: hello.com GET /blah/blah
2:  news.com     GET /hello
3:  facebook        GET /no
4:     yahoo       GET /yes
5:  facebook     GET /hello

> dt[, .N, by = hostname]
    hostname N
1: hello.com 1
2:  news.com 1
3:  facebook 2
4:     yahoo 1

Here .N is data.table parameter that gives you the count. You can rename it to something else ("count" in the below example):

> dt[, .(count = .N), by = hostname]
    hostname count
1: hello.com     1
2:  news.com     1
3:  facebook     2
4:     yahoo     1

If you expect to have multiple possibilities for each entry e.g. facebook or facebook.com or facebook.co.uk, you would need to us regular expressions. A good approach in that case would be to sort by name and then use grep to find the common pattern and aggregate by those.

Gautam
  • 2,597
  • 1
  • 28
  • 51