0

I'm trying to create code that looks at two CSV files: one is a world list of all bird species and their ranges, and the other is a file of all the birds in the Himalayas. I need to check each species in the CSV file with the matching species on the IOC world list one and see if the bird is actually in range (meaning it would say either "India" or "himalayas" or "s e Asia" under the Range column). I want to create a function that can input both data sets, find where names match, check if range contains those words and returns where it does NOT, so I can check those birds specifically. Here is what I have so far (I'm using RStudio):

myfunc <- function() { 

if ((bird_data$Scientific.name == ioc$Scientific.name) &      (ioc$Scientific.name!=("Himalayas" | "se Asia" | "India")) {
print(eval(bird_data$Common.Name[bird_data$Scientific.name == ioc$Scientific.name) & (ioc$Scientific.name!=("Himalayas" | "se Asia" | "India")]))
  }
}
save("myfunc", file = "myfunc.Rdata")  
source("myfunc.Rdata")

I think I'm messed up in not having inputs. So I'm trying a new approach with:

compare = function(data1, data2) {
....
}

But for the above, I don't know how to let the function recognize the appropriate subsets of data (like I can't say data1$Scientific.name).

sschale
  • 5,168
  • 3
  • 29
  • 36
  • 1
    You can say `data1$Scientific.name`. – ytk Mar 13 '16 at 04:12
  • 1
    Also, `ioc$Scientific.name!=("Himalayas" | "se Asia" | "India"` won't work. You have to compare each of them separately. Or you could do something like this: `!(ioc$Scientific.name %in% c('Himalayas', 'se Asia', 'India'))`. – ytk Mar 13 '16 at 04:20
  • Welcome to SO! A few notes: 1. There's no need to save and source; just run the function definition code and it'll show up in RStudio's "Environment" pane. 2. Unless you're going to run this code several times, it's probably not necessary to write it as a function; just save it to variables or print it to the console. 3. If you want a full answer, you need to edit with enough data to produce a [minimal reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). – alistaire Mar 13 '16 at 05:05

1 Answers1

2

It's difficult to answer this question without a minimal reproducible example - without any knowledge of the two dataframes you are comparing it is hard to formulate a solution - see the link in the comment by alistaire above for how to provide this.

I suggest you change your question title to make it more informative - "Creating a function in R" suggests you want to know the syntax required for a function in R - I would recommend "Subsetting a dataframe with Grep and then filtering results in R" - which is what I think you are actually trying to do.

Assuming you obtained your IOC world list data from the International Ornithological Committee website I am unsure whether the approach you describe in your function would work as the data in the column Breeding Range-Subregion(s) is very messy, For example:

w Himalayas to s Siberia and w Mongolia
Himalayas to c China
e Afghanistan to nw India and w Nepal
e Afghanistan to w Tibetan plateau and n India
Africa south of the Sahara, s and se Asia

None of these values is identical to "India" or "himalayas" or "SE Asia" and none will be returned by your function which looks for an exact match. You would need to use grep to find the substring present within your data.

Lets create a toy data set.

bird_data <- data.frame(
        Scientific.name=c(
          "Chicken Little",
          "Woodstock",
          "Woody Woodpecker",
          "Donald Duck",
          "Daffy Duck",
          "Big Bird",
          "Tweety Pie",
          "Foghorn Leghorn",
          "The Road Runner",
          "Angry Birds"))

ioc_data <- data.frame(
  Scientific.name=c(
          "Chicken Little",
          "Woodstock",
          "Woody Woodpecker",
          "Donald Duck",
          "Daffy Duck",
          "Big Bird",
          "Tweety Pie",
          "The Road Runner",
          "Angry Birds"),
  subrange=c(
    "Australia, New Zealand",
    "w Himalayas to s Siberia and w Mongolia",
    "Himalayas to c China",
    "e Afghanistan to nw India and w Nepal",
    "e Afghanistan to w Tibetan plateau and n India",
    "Africa south of the Sahara, s and se Asia",
    "Amazonia to n Argentina",
    "n Eurasia",
    "n North America"))

I would break what you are attempting to do into two steps.

Step 1

Use grep to subset the ioc_data dataframe based upon whether your search terms are found in the subrange column:

searchTerms <- c("India", "himalayas", "SE Asia")

#Then we use grep to return the indexes of matching rows:

matchingIndexes <- grep(paste(searchTerms, collapse="|"), 
                        ioc_data$subrange,
                        ignore.case=TRUE) #Important so search such as "SE Asia" will match "se asia"

#We can then use our matching indexes to subset our ioc_data dataframe producing
#a subset of data corresponding to our range of interest:

ioc_data_subset <- ioc_data[matchingIndexes,]

Step 2

If I understand your question correctly you now want to extract the rows from bird_data that ARE NOT present in the ioc_data_subset (i.e. Which rows in bird_data are for birds that ARE NOT recorded as inhabiting the subrange "India", "SE Asia", and "Himalayas" in the IOC Data.

I would use Hadley Wickham's dplyr package for this - a good cheat sheet can be found here. After installing dplyr:

library(dplyr)

#Create a merged dataframe containing all the data in one place.
merged_data <- dplyr::left_join(bird_data,
                ioc_data,
                by = "Scientific.name")

#Use an anti_join to select any rows in merged_data that are NOT present in
#ioc_data_subset

results <- dplyr::anti_join(merged_data,
                ioc_data_subset,
                by = "Scientific.name")

The left_join is required first because otherwise we would not have the subrange column in our final database. Note that any species in bird_data not in IOC_data will return NA in the subrange column to indicate no data found.

 results
  Scientific.name                subrange
1     Angry Birds         n North America
2 The Road Runner               n Eurasia
3 Foghorn Leghorn                    <NA>
4      Tweety Pie Amazonia to n Argentina
5  Chicken Little  Australia, New Zealand
Graeme
  • 363
  • 1
  • 9