0

I have a dataframe with information for individual birds (identified by a band number), called INFO_Bands. Because birds can have multiple band numbers over their lifetime, each bird also gets a unique birdID. I have a dataframe with the BirdIDs and their corresponding band numbers (sometimes a birdID has one band number, sometimes several), called BirdID_BandNumber. I'd like to add a column with the correct bird ID to my dataframe of information that only has a bird's band number (INFO_Bands).

The dataframe I want to add the birdID to (INFO_Bands) looks something like this. Some birds have multiple entries that I would like to keep, others only have one (for this reason, I don't think "merging" my files will work).

Example of INFO_Bands

Band Number - Score -  Sex -  Island - Sample Date

1234-12345     1        M      MSI      12 Jun

1234-12345     1        M      MR       8 Aug

9876-98765     3        F      MSI      5 May

4567-7890      1        F      MSI      14 Jun

Example of BirdID_BandNumber

Band Number - BirdID

1234-12345     MSI45678

1234-12345     MSI45678

9876-98765     MR1234

4567-7890      MR1234

Desired outcome:

Band Number - Score -  Sex -  Island - Sample Date - BirdID

1234-12345     1        M      MSI      12 Jun        MSI45678

1234-12345     1        M      MR       8 Aug         MSI45678

9876-98765     3        F      MSI      5 May         MR1234

4567-7890      1        F      MSI      14 Jun        MR1234

Any help is appreciated!

Quinn
  • 59
  • 6
  • I find your explanation is hard to follow, would be much easier if you could post a dummy version of the two dataframes (and code to make them) and the desired output (doing this manually as your dummy only needs to have a handful of cases) – rg255 Mar 04 '20 at 20:05
  • Hi, I imported large excel datasheets into R, the data frame with all of the information for my target birds has 1915 obs. of 28 variables. The data frame that has the BirdIDs and their corresponding band number(s) has 44190 obs. of 2 variables. Due to the size of these, I can't provide dummy versions. Unless I'm misunderstanding your question. Thanks! – Quinn Mar 04 '20 at 20:14
  • Just make an example with 2-3 birds and a couple of variables, just enough to replicate your problem because its unclear why merging *won't* be suitable – rg255 Mar 04 '20 at 20:16
  • I edited my post, I hope that clarifies your question. When I tried merging, I ended up with several duplicates I didn't want (which confused the duplicate entries I *do* want, ie. the same band number but on different dates). – Quinn Mar 04 '20 at 20:27
  • Please provide a way to reproduce your problem! – rg255 Mar 04 '20 at 20:39

1 Answers1

0

1.Create example Data (I changed it a bit to make it easy for me to load the data, a minimal reproducible example would have been useful)

INFO_Bands <- read.table(text='BandNumber Score Sex Island SampleDate
1234-12345     1        M      MSI      12Jun
1234-12345     1        M      MR       8Aug
9876-98765     3        F      MSI      5May
4567-7890      1        F      MSI      14Jun', header=TRUE, stringsAsFactors=FALSE)

BirdID_BandNumber <- read.table(text='BandNumber BirdID
1234-12345     MSI45678
1234-12345     MSI45678
9876-98765     MR1234
4567-7890      MR1234', header=TRUE, stringsAsFactors=FALSE)

Now if we do what you want (get the BirdID belonging to a Band Number) we get:

merge(INFO_Bands, BirdID_BandNumber, by="BandNumber")

  BandNumber Score Sex Island SampleDate   BirdID
1 1234-12345     1   M    MSI      12Jun MSI45678
2 1234-12345     1   M    MSI      12Jun MSI45678
3 1234-12345     1   M     MR       8Aug MSI45678
4 1234-12345     1   M     MR       8Aug MSI45678
5  4567-7890     1   F    MSI      14Jun   MR1234
6 9876-98765     3   F    MSI       5May   MR1234

The extra rows are here because it's not clear which of the two entries for BandNumber "1234-12345" in BirdID_BandNumber should be used... so R uses both! (Coincidentally the values for BirdID are the same here).

Solution, assuming we want to use the first entry if multiple entries are foud:

merge(INFO_Bands[!duplicated(INFO_Bands$BandNumber),], BirdID_BandNumber, by="BandNumber")

to remove duplicate BandNumber, or

merge(unique(INFO_Bands), BirdID_BandNumber, by="BandNumber")

to remove duplicate rows:

or:

library(dplyr)
INFO_Bands %>% 
  left_join(BirdID_BandNumber %>% group_by(BandNumber) %>% summarise(BirdID = first(BirdID)))

to select the first entry (or implement other logic).

All return:

BandNumber Score Sex Island SampleDate   BirdID
1 1234-12345     1   M    MSI      12Jun MSI45678
2 1234-12345     1   M     MR       8Aug MSI45678
3 9876-98765     3   F    MSI       5May   MR1234
4  4567-7890     1   F    MSI      14Jun   MR1234
dario
  • 6,415
  • 2
  • 12
  • 26
  • Thank you for your help, I didn't know about the "unique" operation within the "merge" function! I think that will help a lot. Also, thanks for pointing me to the help page for reproducing my data within the forum, I'm new to using this site. – Quinn Mar 05 '20 at 21:26