0

I have two data frames that I want to join on a character column they have in common. The other columns consist of one or more numeric columns of which one has relatively large numbers (8 or 7 digits, while all of the other numeric columns consist of 2 to 4 digit numbers).

When I use a left-join on this character column the 'relatively large number column' has one entry that becomes NA. Mousing over the column reveals that the range of the column is set between 0-40000000, leading to one entry being just outside of this range (40230510).

I tried to solve this by dividing everything in this relatively larger column by 100 and then joining again. This, however, leads to the same entry becoming NA. Now the range is set between 0-400000.

The largest number in the (non-divided) column is 40230510 and the smallest 8488068. How does R set this column range and is there a way to not get an NA value for this entry?

Guello
  • 61
  • 1
  • 8
  • Could you provide a [minimal reproducible example](https://stackoverflow.com/a/5963610/13513328) – Waldi Apr 14 '21 at 20:43
  • Hard to imagine what's causing it without seeing the data. Maybe the large numbers are stored as characters, but one has an extra space, and dividing by 100 is causing an NA there? It would help if you could filter your data to that row and share it. (ie if it's in row 5000 you could run `dput(NAME_OF_YOUR_TABLE[5000,])` and share the output.) – Jon Spring Apr 14 '21 at 21:08
  • This reproduces my dataset: ``` first.col <- c('Africa', 'America', 'Asia', 'Europe', 'Oceania') second.col <- c(1337 ,2097,1723 ,760,295) third.col <- c(54,35,48,48,14) fourth.col <- c(24.75926 ,59.91429 ,35.89583 ,15.83333 ,21.07143) fifth.col <- c(24517854, 40230510, 31690657, 22580025, 8488068) df1 <- data.frame(first.col, second.col, third.col, fourth.col) df2 <- data.frame(first.col, fifth.col) df3 <- left_join(df1, df2, by = 'first.col') ``` Here, the range is correct though. I don't know what could have caused the range to be different on the real data. – Guello Apr 14 '21 at 21:11
  • `structure(list(region = "America", total.share = 2097, region.country = 35, relative.share = 59.9142857142857, total.area = NA_real_), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))` – Guello Apr 14 '21 at 21:17

0 Answers0