0

I am using R and trying to substitute the numbers in the 'owners' columns to its average, and somehow it is keep showing numbers I did not input.

This is how the original chart looks like

And this is how the result looks like

when my code is

name <- 
data.frame(orig = c("50000000-100000000", "1000000-2000000", "5000000-10000000", "2000000-5000000", 
                          "500000-1000000", "200000-500000", "100000-200000", "50000-100000", "20000-50000", 
                          "0-20000", "10000000-20000000", "100000000-200000000", "20000000-50000000"),
                 average = c("75000000", "1500000", "7500000", "3500000", "750000", "350000", "150000", 
                             "75000", "35000", "10000", "15000000", "150000000", "35000000"))
    for (i in 1:NROW(name)){steam2$owners <- sub(name[i,1], name[i,2], steam2$owners)}

The "10000000-20000000" is supposed to be "15000000" but it is showing "100000010000000" for some strange unknown reasons. Can someone help me to fix this issue?

I am using this way to show the picture as I cannot embed images yet.

ekoam
  • 8,744
  • 1
  • 9
  • 22
Lyn
  • 103
  • 2

2 Answers2

0

your first problem is, that those are not numbers but strings. hard to get the mean of a character string. second you might want to read up on regular expressions (regex)

you can check with str(name)

'data.frame':   13 obs. of  2 variables:
 $ orig   : chr  "50000000-100000000" "1000000-2000000" "5000000-10000000" "2000000-5000000" ...
 $ average: chr  "75000000" "1500000" "7500000" "3500000" ...

then:

orig <-as.numeric(sub(".+-", "", na.omit(name$orig))) #.+ will take everything before/after the "-" depending where it is

orig2<-as.numeric(sub("-.+", "", na.omit(name$orig)))

name$orig2<-orig2
name$orig_mean<-as.integer((orig+orig2)/2)

                  orig   average orig2 orig_mean
1   50000000-100000000  75000000 5e+07  75000000
2      1000000-2000000   1500000 1e+06   1500000
3     5000000-10000000   7500000 5e+06   7500000
4      2000000-5000000   3500000 2e+06   3500000
5       500000-1000000    750000 5e+05    750000
6        200000-500000    350000 2e+05    350000
7        100000-200000    150000 1e+05    150000
8         50000-100000     75000 5e+04     75000
9          20000-50000     35000 2e+04     35000
10             0-20000     10000 0e+00     10000
11   10000000-20000000  15000000 1e+07  15000000
12 100000000-200000000 150000000 1e+08 150000000
13   20000000-50000000  35000000 2e+07  35000000

this might help in the future: How to extract everything after a specific string?

D.J
  • 1,180
  • 1
  • 8
  • 17
  • I have used your code and it worked more smoothly speedwise, however the parts I had error with are now all converted into _NA_, which I didn't expect. What should I do about that? – Lyn Oct 29 '20 at 06:29
  • i did not get any errors - ca you show where/on which parts you got the NA-Errors? – D.J Oct 29 '20 at 06:43
  • every rows that were originally '10000000-20000000' are now NA errors now – Lyn Oct 29 '20 at 12:21
  • please update your current code into your question. in my code there are no errors so i can't replicate – D.J Oct 29 '20 at 13:08
  • 1
    Never mind. I think there was some error within R studio, not the code. It ran smoothly when I restarted the computer. Thank you. – Lyn Oct 29 '20 at 14:41
0

As mentioned by @D.J you have character values. You can split the string on "-", create a dataframe with two columns and then take rowwise mean.

df <- do.call(rbind.data.frame, strsplit(name$orig, '-'))
names(df) <- paste0('col', seq_along(df))
df <- type.convert(df, as.is = TRUE)
df$avg <- rowMeans(df)
df

#        col1      col2       avg
#1   50000000 100000000  75000000
#2    1000000   2000000   1500000
#3    5000000  10000000   7500000
#4    2000000   5000000   3500000
#5     500000   1000000    750000
#6     200000    500000    350000
#7     100000    200000    150000
#8      50000    100000     75000
#9      20000     50000     35000
#10         0     20000     10000
#11  10000000  20000000  15000000
#12 100000000 200000000 150000000
#13  20000000  50000000  35000000

If you are ok using packages you can also do :

library(dplyr)
name %>%
  select(-average) %>%
  tidyr::separate(orig, c('col1', 'col2'), sep = '-', convert = TRUE) %>% 
  mutate(average  = rowMeans(., na.rm = TRUE))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213