0

I am working with a large dataset that has many different test parameters and their results. This is for water quality, so each row of data is for a concentration of phosphorus, nitrogen, dissolved, oxygen, etc.

For this analysis, I have one very large master dataset with 30 columns and over 1 million rows, which is subset into several smaller datasets that are filtered by groups of parameters (NH3 parameters, phosphorus parameters, sulfate ions, etc.)

Here are the columns that are present in the master dataset:

> colnames(dat)
 [1] "TEST_NUMBER"    "WY"             "STATION_ID"     "Date"           "DateSerial"    
 [6] "TimeSerial"     "CY"             "Test_Name"      "Category"       "HalfMDL"       
[11] "UNITS"          "MDL"            "Area"           "Class"          "Diversion"     
[16] "TPWeek"         "WYPeriod"       "TPCriterion"    "FFlow_AcFt"     "DBSource"      
[21] "COLLECT_METHOD" "TPCategory"     "Period"         "Current5Year"   "Phase"         
[26] "month"          "HydroSeason"    "LowerCriteria"  "UpperCriteria"  "Date2"         

After the data are subset, they are transformed using the data.frame() and cast() functions. For some reason, this works for some subsets, but not others.

The error that occurs is below:

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
     arguments imply differing number of rows: 256958, 0

Obviously this means that something happens to the dataset that results in there being zero rows of data, but I am not sure what. Each of the subsetted datasets have rows from the master dataset—they aren’t empty.

Below are the codes to subset the master data frame, called dat, into the data frame for each of the 6 smaller datasets and cast them—the first 4 work, but the last two have an error like the one above:

# DO Subset (Works)

DO.dat=subset(dat,dat$TEST_NUMBER==7|dat$TEST_NUMBER==8);
DO.dat.xtab=data.frame(cast(DO.dat[DO.dat$Diversion==0,],STATION_ID+Date+TimeSerial+WY+Area+Class+Period+Current5Year+Phase~Test_Name,value="HalfMDL",mean));

#General Parameters Subset (Works)

GenParam.TestNum=c(9,10,12,36,67)
GenParam.dat=data.table(subset(dat,dat$TEST_NUMBER%in%GenParam.TestNum&Diversion==0));

# NH3 Subset (Works)

NH3.TestNum=c(7,10,20);
NH3.dat=data.table(subset(dat,TEST_NUMBER%in%NH3.TestNum&Diversion==0));
NH3.dat.xtab=data.frame(cast(NH3.dat,STATION_ID+DateSerial+WY+Area+Class+Period+Current5Year+Phase~Test_Name,value="HalfMDL",mean));

# Phosphorus Subset (Works)

P.dat=subset(dat,dat$TEST_NUMBER%in%c(23,25)&dat$Diversion==0);
P.dat.xtab=data.frame(cast(P.dat,STATION_ID+Date+WY+CY+Diversion+Area+Class+Period+Phase+Current5Year+TPCategory+TPCriterion+TPWeek+WYPeriod+HydroSeason+FFlow_AcFt~Test_Name,value="HalfMDL",mean));

# Nitrogen Subset (Error)

N.dat=subset(dat,dat$TEST_NUMBER%in%c(18,21,89,100,20,80)&dat$Diversion==0);
N.dat.xtab=data.frame(cast(N.dat,STATION_ID+Date+WY+CY+Diversion+Area+Class+Period+Phase+Current5Year+TPCategory+TPCriterion+TPWeek+WYPeriod+HydroSeason+FFlow_AcFt~Test_Name,value="HalfMDL",mean));

# Sulfate Subset (Error)

ion.dat=subset(dat,TEST_NUMBER%in%c(28,29,30,31,32,33,67,78)&Diversion==0)
ion.dat.xtab=data.frame(cast(ion.dat,STATION_ID+Date+WY+CY+Diversion+Area+Class+Period+Phase+Current5Year+TPCategory+TPCriterion+TPWeek+WYPeriod+HydroSeason+FFlow_AcFt~Test_Name,value="HalfMDL",mean))

All of the columns in the first part of the cast() equations are present in the resulting x.dat subset data frames, as is the Test_Name column in the second part of the cast() equations.

If there is any reason why the cast function could result in 0 rows of data other than a column referenced in the equation being absent, I would love any ideas.

I have the dat.Rdata, but it’s a large file and I’m not sure how I could make a similar smaller example for stackoverflow.

  • 1
    (1) When using `subset(dat, ...)`, there is no need to use `dat$` on all of the columns, so your first expression can be `subset(dat, TEST_NUMBER==7 | TEST_NUMBER==8)` (slightly more readable, perhaps). (2) Please say where you get `cast`, it is not exported from `reshape2` (though it is an _internal_ function). (3) Sample data would make this much clearer. Please see https://stackoverflow.com/q/5963269 , [mcve], and https://stackoverflow.com/tags/r/info – r2evans Jun 20 '23 at 23:35

0 Answers0