1

I've pulled a table from Wikipedia, but I'm getting a bunch of junk with the population numbers I'm looking for. For instance, I get "!B9840748934017Â 8,244,910" when the actual number I'm after is 8,244,910 only. I've cleaned up the character vector with regex, using sub('![[:alnum:]]*[[:space:]]', '', x)

This works fine, leaving me with the character vector "8,244,910". When I try to convert it to numeric using as.numeric, however, it gets coerced to NA, and I'm unable to get an integer, no matter what conversions I try. Any thoughts?

Rajshri
  • 4,163
  • 2
  • 15
  • 17
zweiler
  • 13
  • 1
  • 3

2 Answers2

6

Try the following:

as.numeric(gsub('![[:alnum:]]*[[:space:]]|[[:punct:]]', '', x))

The problem is that you have commas in the output of your first attempt. Those need to be removed before converting to as.numeric.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
3

Ananda's solution does get the job done but something may go wrong:

  • [:punct:] also includes dot character (.) which is valid part of numeric;
  • bulky regex is hard to read and it's easier to break things up:

    # remove junk

    num_temp = sub('![[:alnum:]]*[[:space:]]', '', x)

    # remove all commas from numerics

    num = gsub(",", "", num_temp)

topchef
  • 19,091
  • 9
  • 63
  • 102