This is my first StackOverflow post, so I hope that it isn't too difficult to understand.
I have a large dataset (~14,000) rows of bird observations. These data were collected by standing in one place (point) and counting birds that you see within 3 minutes. Within each point-count a new bird observation becomes a new row, so that there are many repeated dates, times, sites, and point (specific location within a site). Again, each point count is 3 minutes long. So if you see a yellow warbler (coded as YEWA) during minute 1, then it will be associated with MINUTE=1 for that specific point count (date, site, point, and time). ID=observer intials and Number=number of birds spotted (not necessarily important here).
However, if NO BIRDS are seen, then a "NOBI" goes into the dataset for that specific minute. Thus, if there are NOBI for an entire 3 minute point count, their will be three rows with the same date, site, point, and time, and "NOBI" in the "BIRD" column for each of the three rows.
So I have TWO main problems. The first is that some observers entered "NOBI" once for all three minutes, instead of three times (once per minute). Anywhere where "MINUTE" has been left blank (becoming NA), AND "BIRD"="NOBI", I need to add three rows of data, all with the same values for all columns except "MINUTE", which should be 1, 2, and 3 for the respective rows.
If it looks like this:
ID DATE SITE POINT TIME MINUTE BIRD NUMBER
1 BS 5/9/2018 CW2 U125 7:51 NA NOBI NA
2 BS 5/9/2018 CW1 D250 8:12 1 YEWA 2
3 BS 5/9/2018 CW1 D250 8:12 2 NOBI NA
4 BS 5/9/2018 CW1 D250 8:12 3 LABU 1
It should look like this instead:
ID DATE SITE POINT TIME MINUTE BIRD NUMBER
1 BS 5/9/2018 CW2 U125 7:51 1 NOBI NA
2 BS 5/9/2018 CW2 U125 7:51 2 NOBI NA
3 BS 5/9/2018 CW2 U125 7:51 3 NOBI NA
4 BS 5/9/2018 CW1 D250 8:12 1 YEWA 2
5 BS 5/9/2018 CW1 D250 8:12 2 NOBI NA
6 BS 5/9/2018 CW1 D250 8:12 3 LABU 1
note: If you are wanting to enter some of this data into your R console, I included some at the end of this post using dput, which should be easier to enter than copy-and-pasting the above
I have made failed attempts at reproducing if statements with multiple conditions (based on: R multiple conditions in if statement & Ifelse in R with multiple categorical conditions) I tried writing this many ways, including with piping from dplyr, but see below for one example of some code, notes, and error messages.
>if(PC$BIRD == "NOBI" & PC$MINUTE==NA){PC$Fix<-TRUE}
Error in if (PC$BIRD == "NOBI" & PC$MINUTE == NA) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In if (PC$BIRD == "NOBI" & PC$MINUTE == NA) { :
the condition has length > 1 and only the first element will be used
## Then I need to do something like this:
>if(PC$Fix<-TRUE){duplicate(row where Fix==TRUE, times=2)} #I know this isn't
### even close, but I want the row to be replicated two more times so
### that there are 3 total rows witht he same values
### Fix indicates that a fix is needed in this example
# Then somehow I need to assign a 1 to PC$MINUTE for the first row (original row),
# a 2 to the next row (with other values from other columns being the same), and a 3
# to the last duplicated row (still other values from other columns being the same)
The second problem, which seems more difficult to me is to search the dataset in order or perhaps by DATE,SITE,POINT, and TIME in some way. The minute values should always go from 1... to 2... to 3, and then back to 1 for the next set of date, time, site, and point. That is, each point count should have all values 1:3. However, one count may have multiple sightings in MINUTE=1 so that there are 5 or 6 (or 20) MINUTE=1 before MINUTE=2. BUT, again, some observers in this dataset simply left a row out when there was NO BIRDS (NOBI), instead of writing a row with BIRD="NOBI" for each MINUTE. That is if the dataset goes:
ID DATE SITE POINT TIME MINUTE BIRD NUMBER
...
4 BS 5/9/2018 CW2 U125 7:54 1 AMRO 1
5 BS 5/9/2018 CW2 U125 7:54 1 SPTO 1
6 BS 5/9/2018 CW2 U125 7:57 1 AMRO 1
7 BS 5/9/2018 CW2 U125 7:57 1 SPTO 1
8 BS 5/9/2018 CW2 U125 7:57 1 AMCR 3
9 BS 5/9/2018 CW2 U125 7:57 2 SPTO 1
10 BS 5/9/2018 CW2 U125 7:57 2 HOWR 1
11 BS 5/9/2018 CW2 U125 7:57 3 UNBI 1
We can see that the 7:57 point count time is complete (there are MINUTE values of 1:3). However, the 7:54 point count time stops at MINUTE=1. Meaning, I need to enter two more rows underneath that have all of the same DATE,SITE,POINT,TIME information, but with MINUTE=2 and BIRD="NOBI" for the first added row and MINUTE=3 and BIRD="NOBI" for the second added row. So it should look like this:
ID DATE SITE POINT TIME MINUTE BIRD NUMBER
...
4 BS 5/9/2018 CW2 U125 7:54 1 AMRO 1
5 BS 5/9/2018 CW2 U125 7:54 1 SPTO 1
6 BS 5/9/2018 CW2 U125 7:54 2 NOBI NA
7 BS 5/9/2018 CW2 U125 7:54 3 NOBI NA
8 BS 5/9/2018 CW2 U125 7:57 1 AMRO 1
9 BS 5/9/2018 CW2 U125 7:57 1 SPTO 1
10 BS 5/9/2018 CW2 U125 7:57 1 AMCR 3
11 BS 5/9/2018 CW2 U125 7:57 2 SPTO 1
12 BS 5/9/2018 CW2 U125 7:57 2 HOWR 1
13 BS 5/9/2018 CW2 U125 7:57 3 UNBI 1
Lastly, I understand that this is a long and complicated question, and I may not have articulated it very well. Please let me know if there is any clarification needed, and I would be happy to hear any advice, even if it doesn't fully answer my problems. Thank you in advance!
Everything below this line is only useful for you if you want to enter a sample of my data into R
To enter my data into R console, copy and paste everything from "structure" function to end of code to enter it as dataframe in R console with code: dataframe<-structure(list...
See Example of using dput() for help.
PC<-read.csv("PC.csv") ### ORIGINAL FILE
dput(PC)
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "BS", class = "factor"),
DATE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "5/9/2018", class = "factor"),
SITE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "CW2", class = "factor"),
POINT = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("M", "U125"), class = "factor"),
TIME = structure(c(8L, 8L, 8L, 9L, 9L, 10L, 10L, 10L, 10L,
10L, 10L, 11L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L,
4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 7L), .Label = c("6:48", "6:51",
"6:54", "6:57", "7:12", "7:15", "7:18", "7:51", "7:54", "7:57",
"8:00"), class = "factor"), MINUTE = c(1L, 2L, 3L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 3L, 1L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 1L,
1L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 3L, 3L, NA, NA), BIRD = structure(c(6L,
6L, 6L, 2L, 7L, 2L, 7L, 1L, 7L, 5L, 8L, 8L, 6L, 6L, 6L, 6L,
6L, 6L, 7L, 7L, 7L, 7L, 6L, 8L, 3L, 7L, 9L, 5L, 4L, 2L, 6L,
6L), .Label = c("AMCR", "AMRO", "BRSP", "DUFL", "HOWR", "NOBI",
"SPTO", "UNBI", "VESP"), class = "factor"), NUMBER = c(NA,
NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA,
NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA,
NA)), class = "data.frame", row.names = c(NA, -32L))
PCc<-read.csv("PC_Corrected.csv") #### WHAT I NEED MY DATABASE TO LOOK LIKE
dput(PCc)
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "BS", class = "factor"), DATE = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "5/9/2018", class = "factor"),
SITE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "CW2", class = "factor"), POINT = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("M",
"U125"), class = "factor"), TIME = structure(c(8L, 8L, 8L,
9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L,
1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L,
5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L), .Label = c("6:48",
"6:51", "6:54", "6:57", "7:12", "7:15", "7:18", "7:51", "7:54",
"7:57", "8:00"), class = "factor"), MINUTE = c(1L, 2L, 3L,
1L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 1L, 2L, 3L, 1L, 1L, 2L, 3L, 1L, 1L, 1L,
2L, 3L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), BIRD = structure(c(6L,
6L, 6L, 2L, 7L, 6L, 6L, 2L, 7L, 1L, 7L, 5L, 8L, 8L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 6L, 6L, 7L, 7L, 6L, 8L, 3L,
7L, 9L, 5L, 4L, 2L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("AMCR",
"AMRO", "BRSP", "DUFL", "HOWR", "NOBI", "SPTO", "UNBI", "VESP"
), class = "factor"), NUMBER = c(NA, NA, NA, 1L, 1L, NA,
NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, NA, NA, NA,
NA, 1L, 1L, NA, NA, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-42L))