0

I just start using R for statistical analysis and I am still learning. I have an issue with creating loops in R. I have the following case and I was wondering if any one can help me with it. For me is seems impossible but for some of you it is just a piece of cake. I have a data set for different firms across different years. for each firm I have a quarterly earnings data and I need to calculate the median of earnings for each firm for each year: the data set I have is just like the following:

Date      Firm    Earnings
1Q 2009   A       1000    
2Q 2009   A       1500   
3Q 2009   A       500
4Q 2009   A       2000
1Q 2010   A       1200
2Q 2010   A       1800
3Q 2010   A       2100
4Q 2010   A       2500
1Q 2009   B       1750 
2Q 2009   B       2400
3Q 2009   B       3000
4Q 2009   B       2050
.
.

the result I need is like the following

Year     Firm      Median 
2009      A         1250
2010      A         1950
2009      B         2225
2010      B         ....

I hope you can help me with this issue. thank you in advance :)

Henry
  • 6,704
  • 2
  • 23
  • 39
hbtf.1046
  • 1,377
  • 2
  • 9
  • 8
  • Is the data literally in those kind of random lines, or is it a formatting problem with your post? – Gopala Apr 11 '16 at 21:14
  • @Gopala- it was a formatting problem with the post – Henry Apr 11 '16 at 21:15
  • @Henry - thank you Henry, I am still new with stackoverflow.com. I have been struggling to re-format my post :) – hbtf.1046 Apr 11 '16 at 21:19
  • @hbtf.1046 the { } icon above the edit box is useful for code and tables – Henry Apr 11 '16 at 21:20
  • I would not suggest using loops here - you may not have the language to ask the right question yet, but you are looking to "calculate by group in r" - that would lead you to this question: http://stackoverflow.com/questions/21982987/mean-per-group-in-a-data-frame. I will mark as duplicate, but hope this helps! – Chris Apr 11 '16 at 21:34

2 Answers2

0

Did you mean "Mean" instead of Median? If that's the case, you can use a nifty function called aggregate(). Assuming your second column is called "Year," you could try this:

newdata <- aggregate(mydata$Earnings, list(Year=mydata$Year, Firm=mydata$Firm), mean)
  • Thank you Gerry for your help, but when I apply the code nothing change. I get the same data set – hbtf.1046 Apr 11 '16 at 22:33
  • Did you have the right column names? It worked for me... d<- read.csv("Book1.csv") names(d) = c("Quarter","Year","Firm","Earnings") aggregate(d$Earnings, list(Year=d$Year, Firm=d$Firm), mean) – Gerry Song Apr 11 '16 at 23:25
  • I think you confused with name of the columns. I only have 3 columns. Date column contain the quarter and the year. I need to separate the quarter before I apply your code. By the way your code works fine if I have 4 columns. Thank you again for your help, I appreciate it. – hbtf.1046 Apr 12 '16 at 08:23
0

You can use the helpful plyr package:

install.packages("plyr")

### Assuming your data is stored in a data frame called "x" ###

### Strip the quarters from the Date variable ###
x$Date <- gsub("1Q", "", x$Date)
x$Date <- gsub("2Q", "", x$Date)
x$Date <- gsub("3Q", "", x$Date)
x$Date <- gsub("4Q", "", x$Date)

### Collapse by Date and by Firm ###
y <- ddply(x, c("Date", "Firm"), summarise,
       Median = median(Earnings, na.rm = T))
cody_stinson
  • 390
  • 1
  • 3
  • 12
  • I received this message when I tried to install the package: package ‘dplyr’ is not available (for R version 3.1.1) – hbtf.1046 Apr 11 '16 at 22:03
  • I would recommend updating R! You can do so from "Help --> Check for Updates" – cody_stinson Apr 11 '16 at 22:13
  • dplyr is a very useful package and well worth taking a closer look at. Especially as a new user, it can save you a lot of time in manipulating your dataset. – cody_stinson Apr 11 '16 at 22:15
  • I found the package, it is called "plyr". the code you gave it to me worked just fine. Thank you for your help. I just have one question if possible. If I have daily date and not quarterly date Ex. 1/2/2009, 2/2/2009, 3/2/2009 ..... how can I strip the day and month from the date variable and keep the year??? – hbtf.1046 Apr 11 '16 at 22:30
  • Yes, you're right -- I meant to write "plyr" instead of "dplyr". This is a helpful link to format dates: http://www.statmethods.net/input/dates.html – cody_stinson Apr 11 '16 at 22:47
  • If they're not already classified as dates, you can use x$Date <- as.Date(x$Date, "%m/%d/%Y) – cody_stinson Apr 11 '16 at 22:48