3

I've searched a number of places (stackoverflow, r-blogger, etc), but haven't quite found a good option for doing this in R. Hopefully someone has some ideas.

I have a set of environmental sampling data. The data includes a variety of fields (visit date, region, location, sample medium, sample component, result, etc.).

Here's a subset of the pertinent fields. This is where I start...

visit_date   region    location     media      component     result
1990-08-20   LAKE      555723       water       Mg            *Nondetect
1999-07-01   HILL      432422       water       Ca            3.2
2010-09-12   LAKE      555723       water       pH            6.8
2010-09-12   LAKE      555723       water       Mg            2.1
2010-09-12   HILL      432423       water       pH            7.2
2010-09-12   HILL      432423       water       N             0.8
2010-09-12   HILL      432423       water       NH4          112

What I hope to reach is a table/dataframe like this:

visit_date   region    location     media      component     result        pH
1990-08-20   LAKE      555723       water       Mg            *Nondetect  *Not recorded
1999-07-01   HILL      432422       water       Ca            3.2         *Not recorded
2010-09-12   LAKE      555723       water       pH            6.8         6.8
2010-09-12   LAKE      555723       water       Mg            2.1         6.8
2010-09-12   HILL      432423       water       pH            7.2         7.2
2010-09-12   HILL      432423       water       N             0.8         7.2
2010-09-12   HILL      432423       water       NH4          112          7.2

I attempted to use the method here -- R finding rows of a data frame where certain columns match those of another -- but unfortunately didn't get to the result I wanted. Instead the pH column was either my pre-populated value -999 or NA and not the pH value for that particular visit date if it was collected. Since the result data set is around 500k records, I'm using unique(tResult$pH) to determine the values of the pH column.

Here's that attempt. res is the original result data.frame and component would be the pH result subset (the pH sample results from the main results table).

keys <- c("region", "location", "visit_date", "media")

tResults <- data.table(res, key=keys)
tComponent <- data.table(component, key=keys)

tResults[tComponent, pH>0]

I've attempted using match, merge, and within on the original data frame without success. Since then I've generated a subset for the components (pH in this example) where I copied over the results column to a new "pH" column, thinking I could match the keys and update a new "pH" column in the main result set.

Since not all result values are numeric (with values like *Not recorded) I attempted to use numerics like -888 or other values which could substitute so I could force at least the result and pH columns to be numeric. Aside from the dates which are POSIXct values, the remaining columns are character columns. Original dataframe was created using StringsAsFactors=FALSE.

Once I can do this, I'll be able to generate similar columns for other components that can be used to populate and calculate other values for a given sample. At least that's my goal.

So I'm stumped on this one. In my mind it should be easy but I'm certainly NOT seeing it!

Your help and ideas are certainly welcome and appreciated!

Community
  • 1
  • 1
NWdev
  • 483
  • 1
  • 6
  • 19
  • Why Mg has a value of 6.8? Could you please elaborate that? – Metrics Mar 12 '15 at 01:21
  • These are made up values for illustration, so the 6.8 has no basis in reality. A value of 300 mg/l is a more realistic for Mg. – NWdev Mar 12 '15 at 04:56

1 Answers1

4
#df1 is your first data set and is dataframe
df1$phtem<-with(df1,ifelse(component=="pH",result,NA))

library(data.table)
library(zoo) # locf function

setDT(df1)[,pH:=na.locf(phtem,na.rm = FALSE)]
    visit_date region location media component     result phtem  pH
1: 1990-08-20   LAKE   555723 water        Mg *Nondetect    NA  NA
2: 1999-07-01   HILL   432422 water        Ca        3.2    NA  NA
3: 2010-09-12   LAKE   555723 water        pH        6.8   6.8 6.8
4: 2010-09-12   LAKE   555723 water        Mg        2.1    NA 6.8
5: 2010-09-12   HILL   432423 water        pH        7.2   7.2 7.2
6: 2010-09-12   HILL   432423 water         N        0.8    NA 7.2
7: 2010-09-12   HILL   432423 water       NH4        112    NA 7.2

# you can delete phtem if you don't need.

Edit:

library(data.table)
setDT(df1)[,pH:=result[component=="pH"],by="region,location,visit_date,media"]
df1

   visit_date region location media component     result  pH
1: 1990-08-20   LAKE   555723 water        Mg *Nondetect  NA
2: 1999-07-01   HILL   432422 water        Ca        3.2  NA
3: 2010-09-12   LAKE   555723 water        pH        6.8 6.8
4: 2010-09-12   LAKE   555723 water        Mg        2.1 6.8
5: 2010-09-12   HILL   432423 water        pH        7.2 7.2
6: 2010-09-12   HILL   432423 water         N        0.8 7.2
7: 2010-09-12   HILL   432423 water       NH4        112 7.2
eddi
  • 49,088
  • 6
  • 104
  • 155
Metrics
  • 15,172
  • 7
  • 54
  • 83
  • Thank you for this. I'm working through the result to see if it's what I need it to be. So far, so good. – NWdev Mar 12 '15 at 04:57
  • I hadn't looked at `zoo` as a solution before. The thing I want to ensure is that a given pH value that's being populated from the main results is one that is a match for the visit date, region, location, and medium (essentially that it's a pH sample result that is from the same water sample). I'm just wondering how `zoo's na.locf` works. From my reading so far it looks for the prior non-NA value. Is it also looking at the other columns for match or specifically the dates? – NWdev Mar 12 '15 at 05:06
  • In that case, you don't need to use `na.locf` and thus the zoo `package`. The `data.table` is sufficient. Please see the updated solution. – Metrics Mar 12 '15 at 11:03