0

This is an unusual and difficult question which has perplexed me for a number of days and I hope I explain it correctly. I have two databases i.e. data-frames in R, the first is approx 90,000 rows and is a record of every race-horse in the UK. It contains numerous fields, and most importantly the NAME of each horse and its SIRE; one record per horse First database, sample and fields. The second database contains over one-million rows and is a history of every race a horse has taken part in over the last ten years i.e. races it has run or as I call it 'appearances', it contains NAME, DATE, TRACK etc..; one record per appearance.Second database, sample and fields

What I am attempting to do is to write a few lines of code - not a loop - that will provide me with a total number of every appearance made by the siblings of a particular horse i.e. one grand total. The first step is easy - finding the siblings i.e. horses with a common sire - and you can see it below (N.B FindSire is my own function which does what it says and finds the sire of a horse by referencing the same dataframe. I have simplified the code somewhat for clarity)

 TestHorse <- "Save The Bees"

 Siblings <- which(FindSire(TestHorse) == Horses$Sire)

 Sibsname <- Horses[sibs,1]

The produces Sibsname which is a 636 names long (snippet below), although the average horse will only have 50 or so siblings. I could construct a loop and search the second 'appearances' data-frame and individually match the sibling names and then total the appearances of all the siblings combined. However, I would like to know if I could avoid a loop - and the time associated with it - and write a few lines of code to achieve the same end i.e. search all 636 horses in the appearances database and calculate the times each appears in the database and a total of all these appearances, or to put it another way, how many races have the siblings of "save the bees" taken part in. Thanks in advance.

 [1] "abdication "         "aberdonian "         "acclamatory "        "accolation "  ..... to [636] 
LostAtSea
  • 37
  • 1
  • 6
  • 1
    Please see this link to learn how to provide a reproducible example: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – www Aug 20 '17 at 19:52

1 Answers1

1

Using dplyr, calling your "first database" horses and your "second database" races:

library(dplyr)
test_horse = "Save The Bees"

select(horses, Name, Sire) %>%
  filter(Sire == Sire[Name == tolower(test_horse)]) %>%
  inner_join(races, c("Name" = "SELECTION_NAME")) %>%
  summarize(horse = test_horse, sibling_group_races = n()) 

I am making the assumption that you want the number of appearances of the sibling group to include the appearances of the test horse - to omit them instead add , Name != tolower(test_horse) to the filter() command.

As you haven't shared data reproducibly, I cannot test the code. If you have additional problems I will not be able to help you solve them unless you share data reproducibly. ycw's comment has a helpful link for doing that - I would encourage you to edit your question to include either (a) code to simulate a small sample of data, or (b) use dput() on an small sample of your data to share a few rows in a copy/pasteable format.

The code above will do for querying one horse at a time - if you intend to use it frequently it would be much simpler to just create a table where each row represents a sibling group and contains the number of races. Then you could just reference the table instead of calculating on the fly every time. That would look like this:

sibling_appearances = 
  left_join(horses, races, by = c("Name" = "SELECTION_NAME")) %>%
  group_by(Sire) %>%
  summarize(offspring_appearances = n())
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thank you, great answer. Sorry, I did not have sufficient data for you to check the result but I am new to posting questions - but will do so for the next question I post. The 'index' was what I was actually looking for as you correctly surmised. I have one index already for another portion of the code but it was generated by a series of loops which was too slow. – LostAtSea Aug 21 '17 at 20:16