R studio how to solve a naming substitution error

Question

I am using R and trying to substitute the numbers in the 'owners' columns to its average, and somehow it is keep showing numbers I did not input.

This is how the original chart looks like

And this is how the result looks like

when my code is

name <- 
data.frame(orig = c("50000000-100000000", "1000000-2000000", "5000000-10000000", "2000000-5000000", 
                          "500000-1000000", "200000-500000", "100000-200000", "50000-100000", "20000-50000", 
                          "0-20000", "10000000-20000000", "100000000-200000000", "20000000-50000000"),
                 average = c("75000000", "1500000", "7500000", "3500000", "750000", "350000", "150000", 
                             "75000", "35000", "10000", "15000000", "150000000", "35000000"))
    for (i in 1:NROW(name)){steam2$owners <- sub(name[i,1], name[i,2], steam2$owners)}

The "10000000-20000000" is supposed to be "15000000" but it is showing "100000010000000" for some strange unknown reasons. Can someone help me to fix this issue?

I am using this way to show the picture as I cannot embed images yet.

D.J · Accepted Answer · 2020-10-29T07:44:47.640

your first problem is, that those are not numbers but strings. hard to get the mean of a character string. second you might want to read up on regular expressions (regex)

you can check with str(name)

'data.frame':   13 obs. of  2 variables:
 $ orig   : chr  "50000000-100000000" "1000000-2000000" "5000000-10000000" "2000000-5000000" ...
 $ average: chr  "75000000" "1500000" "7500000" "3500000" ...

then:

orig <-as.numeric(sub(".+-", "", na.omit(name$orig))) #.+ will take everything before/after the "-" depending where it is

orig2<-as.numeric(sub("-.+", "", na.omit(name$orig)))

name$orig2<-orig2
name$orig_mean<-as.integer((orig+orig2)/2)

                  orig   average orig2 orig_mean
1   50000000-100000000  75000000 5e+07  75000000
2      1000000-2000000   1500000 1e+06   1500000
3     5000000-10000000   7500000 5e+06   7500000
4      2000000-5000000   3500000 2e+06   3500000
5       500000-1000000    750000 5e+05    750000
6        200000-500000    350000 2e+05    350000
7        100000-200000    150000 1e+05    150000
8         50000-100000     75000 5e+04     75000
9          20000-50000     35000 2e+04     35000
10             0-20000     10000 0e+00     10000
11   10000000-20000000  15000000 1e+07  15000000
12 100000000-200000000 150000000 1e+08 150000000
13   20000000-50000000  35000000 2e+07  35000000

this might help in the future: How to extract everything after a specific string?

I have used your code and it worked more smoothly speedwise, however the parts I had error with are now all converted into _NA_, which I didn't expect. What should I do about that? — Lyn, Oct 29 '20 at 06:29
i did not get any errors - ca you show where/on which parts you got the NA-Errors? — D.J, Oct 29 '20 at 06:43
every rows that were originally '10000000-20000000' are now NA errors now — Lyn, Oct 29 '20 at 12:21
please update your current code into your question. in my code there are no errors so i can't replicate — D.J, Oct 29 '20 at 13:08
Never mind. I think there was some error within R studio, not the code. It ran smoothly when I restarted the computer. Thank you. — Lyn, Oct 29 '20 at 14:41

score 0 · Answer 2 · answered Oct 29 '20 at 06:34

As mentioned by @D.J you have character values. You can split the string on "-", create a dataframe with two columns and then take rowwise mean.

df <- do.call(rbind.data.frame, strsplit(name$orig, '-'))
names(df) <- paste0('col', seq_along(df))
df <- type.convert(df, as.is = TRUE)
df$avg <- rowMeans(df)
df

#        col1      col2       avg
#1   50000000 100000000  75000000
#2    1000000   2000000   1500000
#3    5000000  10000000   7500000
#4    2000000   5000000   3500000
#5     500000   1000000    750000
#6     200000    500000    350000
#7     100000    200000    150000
#8      50000    100000     75000
#9      20000     50000     35000
#10         0     20000     10000
#11  10000000  20000000  15000000
#12 100000000 200000000 150000000
#13  20000000  50000000  35000000

If you are ok using packages you can also do :

library(dplyr)
name %>%
  select(-average) %>%
  tidyr::separate(orig, c('col1', 'col2'), sep = '-', convert = TRUE) %>% 
  mutate(average  = rowMeans(., na.rm = TRUE))

R studio how to solve a naming substitution error

2 Answers2