3

I have some simple data that I imported from the web that I was using to learn about the fread() function. It imported fine, and I have a small, clean dataset on the populations of continents:

> continent_populations
   Rank     Continent Population_2010 Growth_Rate_Percent World_Pop_Percent
1:    1          Asia   4.581.757.408               1.04%            59.69%
2:    2        Africa   1.216.130.000               2.57%            16.36%
3:    3        Europe     738.849.000               0.08%             9.94%
4:    4 North America     579.024.000               0.96%             7.79%
5:    5 South America     422.535.000               1.04%             5.68%
6:    6       Oceania      38.304.000               1.47%             0.54%
7:    7    Antarctica           1.106                   0            <0.01%

All of these variables are chars, but I want to convert the Population_2010, Growth_Rate_Percent, and World_Pop_Percent variables to numerics. I started simply by using transform():

transform(continent_populations, Population_2010 = as.numeric(Population_2010))

However, I get the warning that NA values have been introduced; all of the values are now NA. I read in this previous thread that, for my Population_2010 variable at least, having comma separators rather than periods might cause an error, so I swapped them for periods:

continent_populations$Population_2010 <- gsub(",", ".", continent_populations$Population_2010)

However, as.numeric() still converts all the values to NA. For the other two variables, I assume that the percent signs will need to be removed. First and foremost, I'm just confused as to why the Population_2010 variable won't convert. I also tried the suggested as.numeric(as.character(var)) workaround, but this didn't work (and seemed pointless anyway, since it is already character type).

I want to know how to properly convert between types (not just here, but for use in proper datasets), so I need to know what is going wrong here. Thanks for any help.

Rowan
  • 351
  • 2
  • 13

1 Answers1

3

Try this solution. The key is to be careful with gsub() and use the proper symbols in order to replace. Also, you can use trimws() in order to remove any space in the values. Here the code:

#Code
#First remove dots from population
df$Population_2010 <- trimws(gsub('.','',df$Population_2010,fixed=T))
#Second remove percent symbol
df$Growth_Rate_Percent <- trimws(gsub('%','',df$Growth_Rate_Percent,fixed=T))
#Finally remove percent and < symbols
df$World_Pop_Percent <- trimws(gsub('%|<','',df$World_Pop_Percent))
#Transform to numeric
df$Population_2010 <- as.numeric(df$Population_2010)
df$Growth_Rate_Percent <- as.numeric(df$Growth_Rate_Percent)
df$World_Pop_Percent <- as.numeric(df$World_Pop_Percent)
str(df)

Output:

str(df)
'data.frame':   7 obs. of  4 variables:
 $ Continent          : chr  "Asia" "Africa" "Europe" "NorthAmerica" ...
 $ Population_2010    : num  4.58e+09 1.22e+09 7.39e+08 5.79e+08 4.23e+08 ...
 $ Growth_Rate_Percent: num  1.04 2.57 0.08 0.96 1.04 1.47 0
 $ World_Pop_Percent  : num  59.69 16.36 9.94 7.79 5.68 ...

Some data used:

#Data
df <- structure(list(Continent = c("Asia", "Africa", "Europe", "NorthAmerica", 
"SouthAmerica", "Oceania", "Antarctica"), Population_2010 = c(4581757408, 
1216130000, 738849000, 579024000, 422535000, 38304000, 1106), 
    Growth_Rate_Percent = c(1.04, 2.57, 0.08, 0.96, 1.04, 1.47, 
    0), World_Pop_Percent = c(59.69, 16.36, 9.94, 7.79, 5.68, 
    0.54, 0.01)), row.names = c("1", "2", "3", "4", "5", "6", 
"7"), class = "data.frame")
Duck
  • 39,058
  • 13
  • 42
  • 84
  • Thankyou! That worked absolutely perfectly. I've always kept Regex at arm's length - Maybe it's about time I learned it properly. If you don't mind, could you tell me what your `fixed` argument is doing inside the `gsub()` function? – Rowan Sep 16 '20 at 11:56
  • 1
    @Rowan Hi dear, yes the `fixed` is a control for regex, sometimes if you omit it all text can dissapear. You can play with multiple examples about that and see what happens. Symbols like dots use to have this issue. That is why `fixed` helps to avoid troubles :) – Duck Sep 16 '20 at 12:21