3

I have CSV data of a log for 24 hours that looks like this:

svr01,07:17:14,'u1@user.de','8.3.1.35'
svr03,07:17:21,'u2@sr.de','82.15.1.35'
svr02,07:17:30,'u3@fr.de','2.15.1.35'
svr04,07:17:40,'u2@for.de','2.1.1.35'

I read the data with tbl <- read.csv("logs.csv")

How can I plot this data in a histogram to see the number of hits per hour? Ideally, I would get 4 bars representing hits per hour per srv01, srv02, srv03, srv04.

Thank you for helping me here!

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
poseid
  • 6,986
  • 10
  • 48
  • 78
  • 1
    It would help if you provide a reproducible example... – Paul Hiemstra Dec 22 '11 at 10:42
  • The idea is that you have a directory with logfiles coming from 4 different servers, e.g. server01.log, server02.log, server03.log and sever04.log. Next, you grep for "login successful" over these 4 files. You get something in the form of: server01: login successful with parameters ( :login => "u1@user.de", :created_at => "07:17:13", ... ) You reformat this with awk and get one file, e.g. logs.csv, with content as shown above. – poseid Dec 23 '11 at 08:43
  • Thanks for the feedback. However, I meant reproducible in the sense of reproducible R code that reproduces the situation that is related to your specific R question. – Paul Hiemstra Dec 23 '11 at 09:07
  • ok.. I see... I had several smaller problems that caused my rather general question. First, I tried to use an example from using the Zoo library: library(zoo) --> Result: The following object(s) are masked from ‘package:base’: as.Date, as.Date.numeric Another experiment I did was first doing some kind of simple time-scale plot, with time on X, and logins on Y. I did: scale <- tbl[2], email <- tbl[3] and plot(scale, email). Result: 'x' and 'y' lengths differ. I guess this would be 2 new questions for SO. – poseid Dec 23 '11 at 10:16

2 Answers2

9

I don't know if I understood you right, so I will split my answer in two parts. The first part is how to convert your time into a vector you can use for plotting.

a) Converting your data into hours:

  #df being the dataframe
  df$timestamp <- strptime(df$timestamp, format="%H:%M:%S")
  df$hours <-  as.numeric(format(df$timestamp, format="%H"))
  hist(df$hours)

This gives you a histogram of hits over all servers. If you want to split the histograms this is one way but of course there are numerous others:

b) Making a histogram with ggplot2

 #install.packages("ggplot2")
  require(ggplot2)
  ggplot(data=df) + geom_histogram(aes(x=hours), bin=1) +  facet_wrap(~ server)
  # or use a color instead
  ggplot(data=df) + geom_histogram(aes(x=hours, fill=server), bin=1)

c) You could also use another package:

 require(plotrix)
 l <- split(df$hours, f=df$server)
 multhist(l)

The examples are given below. The third makes comparison easier but ggplot2 simply looks better I think.

EDIT

Here is how thes solutions would look like

first solution: enter image description here

second solution: enter image description here

third solution: enter image description here

Seb
  • 5,417
  • 7
  • 31
  • 50
  • I added some example data in my post, maybe you could test your code with that. – Paul Hiemstra Dec 22 '11 at 11:06
  • @PaulHiemstra thanks - in the meantime I tested it as well. but i like how you generate random times - i did it more akward :D – Seb Dec 22 '11 at 11:09
  • 1
    If you could upload your resulting picture that would make the amount of ggplot awesomeness even bigger :). And it presents the OP with more options. – Paul Hiemstra Dec 22 '11 at 11:11
  • In your second histogram, do the frequencies add up, or are they super positioned? I like the facet_wrap version much better. – Paul Hiemstra Dec 22 '11 at 11:14
  • i like the facet wrap way too. the frequencies add up i the second image. shouldn't they? – Seb Dec 22 '11 at 11:16
  • I was just curious. I think the fact that they add up makes it hard to interpret the development of the hit count of server b. This is because of the fact that the length of the blue vertical bar is not only depended on the no of hist of server b, but also of the hits of server a. That's why I like the facet_wrap version. If you want to see the total no hits (which is harder to see from the facet_wrap version) I would make just one histogram and leave out the server id's. – Paul Hiemstra Dec 22 '11 at 11:20
  • yes you're right this indeed makes it hard to interpret - its main use i think is a way of analysing the composition of total hits - and for sure for this purpose this is probably sub optimal. i added a `plotrix` example to have a comparison. – Seb Dec 22 '11 at 11:28
8

An example dataset:

dat = data.frame(server = paste("svr", round(runif(1000, 1, 10)), sep = ""),
                 time = Sys.time() + sort(round(runif(1000, 1, 36000))))

The trick I use is to create a new variable which only specifies in which hour the hit was recorded:

dat$hr = strftime(dat$time, "%H")

Now we can use some plyr magick:

hits_hour = count(dat, vars = c("server","hr"))

And create the plot:

ggplot(data = hits_hour) + geom_bar(aes(x = hr, y = freq, fill = server), stat="identity", position = "dodge")

Which looks like:

enter image description here

I don't really like this plot, I'd be more in favor of:

ggplot(data = hits_hour) + geom_line(aes(x = as.numeric(hr), y = freq)) + facet_wrap(~ server, nrow = 1)

Which looks like:

enter image description here

Putting all the facets in one row allows easy comparison of the number of hits between the servers. This will look even better when using real data instead of my random data.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149