0

I am trying to use the R merge function to combine two data.frames, but keep getting the following error:

Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column

I am not sure what this error means or how to resolve it.

My code thus far is the following:


movies <- read_csv("movies.csv")

firsts = vector(length = nrow(movies))
for (i in 1:nrow(movies)) {
  firsts[i] = movies$director[i] %>% str_split(" ", n = 2) %>% unlist %>% .[1]
}

movies$firsts = firsts

movies <- movies[-c(137, 147, 211, 312, 428, 439, 481, 555, 602, 830, 850, 1045, 1080, 1082, 1085, 1096, 1255, 1258, 1286, 1293, 1318, 1382, 1441, 1456, 1494, 1509, 1703, 1719, 1735, 1944, 1968, 1974, 1977, 2098, 2197, 2409, 2516, 2546, 2722, 2751, 2988, 3191,
3227, 3270, 3283, 3285, 3286, 3292, 3413, 3423, 3470, 3480, 3511, 3676, 3698, 3826, 3915, 3923, 3954, 4165, 4381, 4385, 4390, 4397, 4573, 4711, 4729, 4774, 4813, 4967, 4974, 5018, 5056, 5258, 5331, 5405, 5450, 5469, 5481, 4573, 5708, 5715, 5786, 5886, 5888, 5933, 5934, 6052, 6091, 6201, 6234, 6236, 6511, 6544, 6551, 6562, 6803, 4052, 4121, 4326),]
movies <- movies[-c(4521,5846),]

g <- gender_df(movies, name_col = "firsts", year_col = "year", method = c("ssa"))

merge(movies, g, by = c("firsts", "name"), all = FALSE)

Cettt
  • 11,460
  • 7
  • 35
  • 58
Lee
  • 1
  • What is the output of `base::intersect(names(movies),names(g))`?! – NelsonGon May 03 '19 at 14:14
  • The output I think is, "name". Sorry, I am new to R. – Lee May 03 '19 at 14:16
  • What is the function `gender_df`? [See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making an R question that folks can help with. That includes a sample of data and the *minimal* code needed for the issue – camille May 03 '19 at 17:23

1 Answers1

0

I thinks you are trying to give the by argument a non-valid value. Indeed, the documentation tells:

By default the data frames are merged on the columns with names they both have, but separate specifications of the columns can be given by by.x and by.y. The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each. For the precise meaning of ‘match’, see match.

In your case, you shall try the following:

merge(x = movies,y =  g, by.x = "firsts", by.y = "name", all = FALSE)
camille
  • 16,432
  • 18
  • 38
  • 60
Elie Ker Arno
  • 346
  • 1
  • 11
  • Thank you! It not longer returns an error, but my resulting dataframe contains many more observations than the two dataframes I am merging. Do you happen to know why this is occurring? – Lee May 03 '19 at 14:37
  • I could guess that `merge()` made this happen. It can occur with the joining of two dataframes: each single value of X is associated with each tuple of Y. Is your new number of observations a mutliple of one of your data frames? I searched about your 'gender_df' function, and couldn't find any package mentioning it. Could you tell me more about it and/or write its specification? I just had a thought about the merge() function and the syntax of this function. – Elie Ker Arno May 03 '19 at 15:12
  • It worked now, thank you! I had forgot I made a mistake with syntax is some prior code utilizing gender_df. Thank you for your help:) – Lee May 03 '19 at 16:59