2

I have a dataset containing several variables and I wish to statistically test the variances (Kruskal-test) for each variable seperately.

My data (df) looks like that: (carbon and nitrogen content for diffrent agricultural managements (see name)). I have 16 groups (to simplify it, I´d say, I have got 8 groups):

extract of the data

1. List item
name    N_cont  C_cont  agriculture
C_ero   1,064   8,380   1
C_ero   0,961   8,086   1
C_ero   0,977   8,331   1
Ds_ero  1,767   17,443  2
Ds_ero  1,802   18,264  2
Ds_ero  2,083   20,112  2
Ms_ero  1,547   14,380  3
Ms_ero  1,566   15,313  3
Ms_ero  1,505   14,760  3
Md_ero  1,512   14,303  4
Md_ero  1,656   15,331  4
Md_ero  1,500   13,788  4
C_upsl  1,121   10,581  5
C_upsl  1,159   10,460  5
C_upsl  1,223   10,171  5
Ds_upsl 1,962   20,656  6
Ds_upsl 1,784   16,780  6
Ds_upsl 1,720   17,482  6
Ms_upsl 1,578   16,228  7
Ms_upsl 1,634   15,331  7
Ms_upsl 1,394   13,419  7
Md_upsl 1,286   11,824  8
Md_upsl 1,241   11,452  8
Md_upsl 1,317   11,932  8

I already put a factor for the agriculture

df$agriculture<-factor(df$agriculture)

I can do statistical tests compairing all of the 16 groups. e.g. kruskal.test(df$C,df$agriculture)

But now I would like to do statistic tests just for specific groups out of the 8 groups, e.g. those which contain e.g. an C (Conventional) or rather DS (Direct seeding) in the name column or e.g. ero (eroding site) or upsl (upper slope)

It did try grep or split, but it did not work, because the dimension of x and y should be the same.

Do you have any clue?

S.R.
  • 25
  • 7
  • Hello. Please have a look at this [link](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Sotos Feb 24 '16 at 08:31

2 Answers2

1

You can try to subset with grepl. Assuming you want rows whose name contains either DS, upsl or C then

df[grepl("(DS)|(upsl)|(C)", df$name), ]

#     name  N_cont C_cont   agriculture
#1    C_ero  1,064  8,380           1
#2    C_ero  0,961  8,086           1
#3    C_ero  0,977  8,331           1
#13  C_upsl  1,121 10,581           5
#14  C_upsl  1,159 10,460           5
#15  C_upsl  1,223 10,171           5
#16 Ds_upsl  1,962 20,656           6
#17 Ds_upsl  1,784 16,780           6
#18 Ds_upsl  1,720 17,482           6
#19 Ms_upsl  1,578 16,228           7
#20 Ms_upsl  1,634 15,331           7
#21 Ms_upsl  1,394 13,419           7
#22 Md_upsl  1,286 11,824           8
#23 Md_upsl  1,241 11,452           8
#24 Md_upsl  1,317 11,932           8

If you do not want to hard code the name values , you can also try,

x <- c("C", "DS", "upsl")
df[grepl(paste0(x, collapse = "|"), df$name), ]

which would also yield the same result.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you! That is exactly what I´m looking for! Another little question: Is there a possibility to get the results separately in one step, that I get a result like that (as a list): [1]”C” name N_cont C_cont agriculture C_ero 1,064 8,380 1 C_ero 0,961 8,086 1 C_ero 0,977 8,331 1 C_upsl 1,121 10,581 5 C_upsl 1,159 10,460 5 C_upsl 1,223 10,171 5 [2]“DS“ Ds_ero 1,767 17,443 2 Ds_ero 1,802 18,264 2 Ds_ero 2,083 20,112 2 Ds_upsl 1,962 20,656 6 Ds_upsl 1,784 16,780 6 Ds_upsl 1,720 17,482 6 – S.R. Feb 24 '16 at 10:13
  • not sure, what you want exactly. Could you update the question? – Ronak Shah Feb 24 '16 at 12:13
0

Load the data.table package.

library(data.table)

Create a subset of the group you want to do your stats on: if your dataframe is df, then

DT<-data.table(df)
DT[like(name,"C_")]

.. OR use the sqldf package:

library(sqldf)
sqldf("select * from df where name like 'C_'")
CuriousBeing
  • 1,592
  • 14
  • 34