0

I have split my dataframe according to a range of sub-intervals of one column of continuous data:

Data1 <- read.csv(file.choose(), header = T)

# Order (ascending)by size
Group.order <- order(GroupN)

# Assign label to data frame ordered by group
Data1.group.order <- Data1[Group.order, ]

# Set a range of sub-intervals we wish to split the ordered data into
range <- seq(0, 300, by=75)

# Use the split function to split the ordered data, using the cut function which will           
# cut the numeric vector GroupN by the value 'range'
Split.Data1 <- split(Data1.group.order, cut(Data1.group.order$GroupN, range))

With the data split, I now need to find the mean value of one of the columns in all sub-sets of the data frame but despite a lot of effort I'm struggling.

However, I have been to able to find the mean of multiple columns across the whole split data frame using the lapply function, but not one column on its own.

Any help would be appreciated.

EDIT: I am an R newbie, so what I really want to do is look at a distribution of variable x for each sub-set of the data frame, i.e. x-axis = 0-75, 75-150, 150-225, 225-300, y-axis = variable x. My planning was to split the data, find the mean values of variable x for each subset of the dataframe, then plot variable x by the intervals I subset the dataframe by. However, I'm sure there's a better way of doing this!

user3237820
  • 211
  • 1
  • 8
  • 2
    [How to make a great R reproducible example?](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Julius Vainora Jan 26 '14 at 16:14
  • Perhaps something like `lapply(split(DF, f), function(x) mean(x$column_of_interest))` is helpful – alexis_laz Jan 26 '14 at 16:24
  • 1
    why slip it in the first place? perhaps using plyr, dplyr or data.table packages are a better call for this. – marbel Jan 26 '14 at 16:31
  • 1
    If you want help, you really need to post your data (or a representative subset), and show the code you've tried. – jlhoward Jan 26 '14 at 21:19

1 Answers1

1

How about something like this with plyr:

require(plyr) # library

dat<-data.frame(x=sample(1:300,300),y=runif(300)*10)   # create random data
head(dat)

#    x        y
#1 193 2.580328
#2 119 4.519489
#3  51 5.340437
#4 114 9.249253
#5 236 4.756849
#6 108 5.926478

ddply(dat,                                                 # use dat
      .(grp=cut(dat$x,seq(0,300,75),seq(0,300,75)[-1])),   # group by formula (cut)
      summarise,                                           # tell ddply to summarise
      mean=mean(y),                                        # calc mean
      sum=sum(y))                                          # calc sum

#  grp     mean      sum
#1  75 4.620653 346.5490
#2 150 5.337813 400.3360
#3 225 4.238518 317.8889
#4 300 4.996709 374.7532
Troy
  • 8,581
  • 29
  • 32