1

I am thinking how to lookup time data by University name (first row: A,...,F), Field name (first column: Acute,...,En) and/or graduation time (time) in the following file DS.csv. I am thinking dplyr approach but could not expand numerical ID lookup (thread answer How to overload function parameters in R?) to the lookup by three variables. Challenges

  1. How to lookup by the first row? Maybe, something similar to $1 == "A".
  2. How to Expand university lookup to two columns? Pseudocode $1 == "A" is about the second and third column, ..., $1 == "F" about two last columns.
  3. Do lookup by 3 lookup criterias: first row (no header), first column with header Field and for the header time. Pseudocode

    times <- getTimes($1 == "A", Field == "Ane", by = "desc(time)")
    

Data DS.csv has the data. The first column denotes experiment. The data below is in crosstab format such that

,A,,B,,C,,D,,E,,F,
Field,time,T,time,T,time,T,time,T,time,T,time,T
Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3

and in the straight table format such that

Field,time,T,Experiment       
Acut,0,0,A
An,9,120,A
En,15.6,2,A
Fo,9.2,2,A
Acute,8.3,1,B       
An,7.7,26,B
En,12.9,1,B
Fo,0,0,B  
Acute,7.5,1,C       
An,7.9,43,C
En,0,0,C  
Fo,5.4,1,C
Acute,8.6,2,D       
An,7.8,77,D
En,0,0,D  
Fo,0,0,D  
Acute,0,0,E         
An,7.9,60,E
En,14.3,1,E
Fo,0,0,E  
Acute,8.3,4,F       
An,8.2,326,F
En,14.6,4,F
Fo,7.9,3,F

Pseudocode

library('dplyr')
ow <- options("warn")
DF <- read.csv("/home/masi/CSV/DS.csv", header = T)

# Lookup by first row, Lookup by Field, lookup by Field's first column?
times <- getTimes($1 == "A", Field == "Ane", by = "desc(time)")

Expected output: 9
Expected output generalised: a, b, c, ...

## Data where values marked by small letters a,b,c, ... are wanted 
#       uni1    uni2 ... 
#       time T  time T ...
#Field1 a       c
#Field2 b       ...
#...    ...

R: 3.3.3 (2017-03-06)
OS: Debian 8.7
Hardware: Asus Zenbook UX303UA

Jaap
  • 81,064
  • 34
  • 182
  • 193
Léo Léopold Hertz 준영
  • 134,464
  • 179
  • 445
  • 697
  • @hhh I've rolled back your edit. You should work with the data OP has posted and not insert your own representation of the data. – Jaap Apr 29 '17 at 12:03
  • But do you know how you can transform your raw data to the edited data? If not, you shouldn't include it in your question imo. It is then to the people answering to get to the desired format. – Jaap Apr 29 '17 at 13:35

3 Answers3

3

Taking your initial raw data as starting point:

# read the data & skip 1st & 2nd line which contain only header information
DF <- read.csv(text=",A,,B,,C,,D,,E,,F,
Field,time,T,time,T,time,T,time,T,time,T,time,T
Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3", header=FALSE, stringsAsFactors=FALSE, skip=2)

# read the first two lines which contain the header information
headers <- read.csv(text=",A,,B,,C,,D,,E,,F,
Field,time,T,time,T,time,T,time,T,time,T,time,T
Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3", header=FALSE, stringsAsFactors=FALSE, nrow=2)

# extract the university names for the 'headers' data.frame
universities <- unlist(headers[1,])
universities <- universities[universities != '']

# create column names from the 'headers' data.frame
vec <- headers[2,][headers[2,] == 'T']
headers[2,][headers[2,] == 'T'] <- paste0(vec, seq_along(vec))
names(DF) <- paste0(headers[2,],headers[1,])

You dataframe now looks as follows:

> DF
   Field timeA  T1 timeB T2 timeC T3 timeD T4 timeE T5 timeF  T6
1: Acute   0.0   0   8.3  1   7.5  1   8.6  2   0.0  0   8.3   4
2:   Ane   9.0 120   7.7 26   7.9 43   7.8 77   7.9 60   8.2 326
3:    En  15.6   2  12.9  1   0.0  0   0.0  0  14.3  1  14.6   4
4:    Fo   9.2   2   0.0  0   5.4  1   0.0  0   0.0  0   7.9   3

As it is better to transform you data into long format:

library(data.table)
DT <- melt(setDT(DF), id = 1, 
           measure.vars = patterns('^time','^T'),
           variable.name = 'university', 
           value.name = c('time','t')
           )[, university := universities[university]][]

Now your data looks like:

> DT
    Field university time   t
 1: Acute          A  0.0   0
 2:   Ane          A  9.0 120
 3:    En          A 15.6   2
 4:    Fo          A  9.2   2
 5: Acute          B  8.3   1
 6:   Ane          B  7.7  26
 7:    En          B 12.9   1
 8:    Fo          B  0.0   0
 9: Acute          C  7.5   1
10:   Ane          C  7.9  43
11:    En          C  0.0   0
12:    Fo          C  5.4   1
13: Acute          D  8.6   2
14:   Ane          D  7.8  77
15:    En          D  0.0   0
16:    Fo          D  0.0   0
17: Acute          E  0.0   0
18:   Ane          E  7.9  60
19:    En          E 14.3   1
20:    Fo          E  0.0   0
21: Acute          F  8.3   4
22:   Ane          F  8.2 326
23:    En          F 14.6   4
24:    Fo          F  7.9   3

Now you can select the required info:

 DT[university == 'A' & Field == 'Ane']

which gives:

   Field university time   t
1:   Ane          A    9 120

Several dplyr examples to filter the data:

library(dplyr)
DT %>% 
  filter(Field=="En" & t > 1)

gives:

  Field university time t
1    En          A 15.6 2
2    En          F 14.6 4

Or:

DT %>%
  arrange(desc(time)) %>%
  filter(time < 14 & t > 3)

gives:

  Field university time   t
1   Ane          A  9.0 120
2 Acute          F  8.3   4
3   Ane          F  8.2 326
4   Ane          C  7.9  43
5   Ane          E  7.9  60
6   Ane          D  7.8  77
7   Ane          B  7.7  26
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • 1
    @LéoLéopoldHertz준영 I've added an example on how to filter with `dplyr`. I used the `data.table`-package for reshaping to long format because it is more flexible than the `tidyr` package, especially because of the possibility to create multiple value-column at once. With regard to the code in your comment: you should use an `&` instead of a comma. – Jaap Apr 29 '17 at 15:23
  • 1
    @LéoLéopoldHertz준영 Yes, I undestand: it is invalid R-syntax. Better: `DF[order(-time)][time < 14 & t > 3]` (if it is a data.table) or `DF[order(-DF$time)][DF$time < 14 & DF$t > 3, ]` if it is a data.frame (notice the extra comma at the end). – Jaap Apr 29 '17 at 19:24
  • 1
    @LéoLéopoldHertz준영 It depends on what you are used to, but with dplyr it is: `DF %>% arrange(desc(time)) %>% filter(time < 14 & t > 3)` – Jaap Apr 29 '17 at 19:48
  • 1
    @LéoLéopoldHertz준영 That is the `data.table` of filtering. See also my [previous comment](http://stackoverflow.com/questions/43694637/how-to-lookup-by-row-and-column-column-names#comment74444218_43695774). The data.frame method is: `DF[DF$Experiment=="A", ]` – Jaap Apr 30 '17 at 11:40
  • 1
    @LéoLéopoldHertz준영 For some more info: (1) [an introduction into `data.table`](https://github.com/Rdatatable/data.table/wiki/Getting-started) and (2) [an intro on `dplyr`](http://dplyr.tidyverse.org/) – Jaap Apr 30 '17 at 12:05
  • Thank you! I am trying to combine/merge outputs here for more dynamic case here http://stackoverflow.com/q/43706199/54964 – Léo Léopold Hertz 준영 Apr 30 '17 at 12:12
2

Change your crosstab

,A,,B,,C,,D,,E,,F,
Field,time,T,time,T,time,T,time,T,time,T,time,T
Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3

into a straight data format

Field,time,T,Experiment
Acut,0,0,A
An,9,120,A
En,15.6,2,A
Fo,9.2,2,A
Acute,8.3,1,B
An,7.7,26,B
En,12.9,1,B
Fo,0,0,B
Acute,7.5,1,C
An,7.9,43,C
En,0,0,C
Fo,5.4,1,C
Acute,8.6,2,D
An,7.8,77,D
En,0,0,D
Fo,0,0,D
Acute,0,0,E
An,7.9,60,E
En,14.3,1,E
Fo,0,0,E
Acute,8.3,4,F
An,8.2,326,F
En,14.6,4,F
Fo,7.9,3,F

where I used Vim.csv plugin and visual-block mode.

Multiple ways to do the selection

This is very easy to do in multiple ways after tidying the data into easy-to-format straight table (not crosstab), I would prefer SQL. I demonstare SQLDDF-package below that is very inefficient with large data but this is small so it will work.

Also instead of the very inefficient builtin functions, such as read.csv, I would refer the very efficient fread in data.table package for reading files.

SQLDF

enter image description here

> library(data.table);
> a<-fread("~/DS_straight_table.csv");
> sqldf("select time from a where Experiment='A' and Field='An'")
  time
1    9

Other without sqldf

> library(data.table);
> a<-fread("~/DS_straight_table.csv");
> a[Experiment=='A' & Field=='An'] 
Field time   T Experiment
1:    An    9 120          A
hhh
  • 50,788
  • 62
  • 179
  • 282
1

Using the "Tall" (straight table) format and library dplyr. Your data only has one value per Field, Experiment.

library(dplyr)    

## this is the more general result
df %>% 
  group_by(Field, Experiment) %>%
  top_n(1, wt = -time)


## example function
getTimes<- function(data, field, experiment) {
  data %>% 
    filter(Field == field, Experiment == experiment) %>%
    top_n(1, wt = -time)
}


getTimes(df, 'An', 'A')

#   Field time   T Experiment
# 1    An    9 120          A
Andrew Lavers
  • 4,328
  • 1
  • 12
  • 19
  • 1
    .. as given by @hhh – Andrew Lavers Apr 29 '17 at 12:43
  • I am thinking to apply descending ordering by `time`, like `DF[time < 14 & t > 3 & by = "desc(time)"]`. I cannot get it work with `data.table` package so I think `dplyr` should work. What do you think? – Léo Léopold Hertz 준영 Apr 29 '17 at 19:17
  • 1
    in the above `top_n `picks the first row in on order of `-time` -- the minus sign "-" makes it descending so this picks the latest time. If you just want it ordered then you can use `arrange(time) ` or `arrange (-time)`. More correct is to use `arrange(desc(time))` because desc works for character as well as numeric. – Andrew Lavers Apr 29 '17 at 19:44
  • @epi99 Yes. Can you integrate the ordering in the paremter of the function `getTimes`? It should be possible. Etc `getTimes(df, Field='An', T='A', desc(time))`. - - You can also have strict passing of parameters, like my example. I think it is better than say that the second parameter is for this and third for this. The function parameters can also be conditionals like `time < 5`. – Léo Léopold Hertz 준영 Apr 30 '17 at 11:02