0

I have this data table:

Year    GDP
1998–99 <U+20B9>1,668,739
1999–00 <U+20B9>1,858,205
2000–01 <U+20B9>2,000,743
2001–02 <U+20B9>2,175,260
2002–03 <U+20B9>2,343,864
2003–04 <U+20B9>2,625,819
2004–05 <U+20B9>2,971,464
2005–06 <U+20B9>3,390,503
2006–07 <U+20B9>3,953,276
2007–08 <U+20B9>4,582,086
2008–09 <U+20B9>5,303,567
2009–10 <U+20B9>6,108,903
2010–11 <U+20B9>7,248,860
2011–12 <U+20B9>8,391,691
2012–13 <U+20B9>9,388,876

What I want to do is to remove "" from all of the rows. How can I do it?

I was trying with grepl and grep, but did not work for me:

df[!grepl("<U+20B9>", df$GDP),]

df[ grep("REVERSE", df$Name, invert = TRUE) , ]

These do not work for me...

What I want is something like this:

Year    GDP
1998–99 1,668,739
1999–00 1,858,205
2000–01 2,000,743
2001–02 2,175,260
2002–03 2,343,864
2003–04 2,625,819
2004–05 2,971,464
2005–06 3,390,503
2006–07 3,953,276
2007–08 4,582,086
2008–09 5,303,567
2009–10 6,108,903
2010–11 7,248,860
2011–12 8,391,691
2012–13 9,388,876

I also tried using below solution but did not work for me either... How to identify/delete non-UTF-8 characters in R

x <- "<U+20B9>"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='')

returns me "<U+20B9>" as it is...
Community
  • 1
  • 1
Madhu Sareen
  • 549
  • 1
  • 8
  • 20
  • Possible duplicate of [How to identify/delete non-UTF-8 characters in R](http://stackoverflow.com/questions/17291287/how-to-identify-delete-non-utf-8-characters-in-r) – r2evans Mar 24 '17 at 20:14
  • 2
    I found a solution `df$GDP <- substring(df$GDP, 2)` – Madhu Sareen Mar 24 '17 at 20:44

2 Answers2

1

a data.table attempt with some example data

data <- setDT(data.frame(
 Year=c('1998–99', 
     '1999–00', 
     '2000–01', 
     '2001–02', 
     '2002–03', 
     '2003–04', 
     '2004–05', 
     '2005–06', 
     '2006–07', 
     '2007–08'),
 GDP=c('<U+20B9>1,668,739',
    '<U+20B9>1,858,205',
    '<U+20B9>2,000,743',
    '<U+20B9>2,175,260',
    '<U+20B9>2,343,864',
    '<U+20B9>2,625,819',
    '<U+20B9>2,971,464',
    '<U+20B9>3,390,503',
    '<U+20B9>3,953,276',
    '<U+20B9>4,582,086')))

data[,GDP:=sub("^\\s*<U\\+\\w+>\\s*",'',data$GDP)]

the regular epxression pattern for this can be viewed as:

  1. U \ \ + part implies like a sequence of U+

  2. \ \ w+ simply states letters or digitis, more than just 1

  3. this is in part wrapped in < > and then \ \ s* which just removes any whitespaces

jg_r
  • 81
  • 5
  • 1
    I also found a solution `df$GDP <- substring(df$GDP, 2)` – Madhu Sareen Mar 24 '17 at 20:53
  • 1
    This doesn't work. Your sample data is using the literal string ``, which is how R is *represents* (but not *stores*) a unicode character. (Example: type in `"\u20b9"`.) As such, `sub`ing for the literal `` does not work. – r2evans Mar 24 '17 at 21:04
0

Smallest answer to above is:

df$GDP <- substring(df$GDP, 2)
Madhu Sareen
  • 549
  • 1
  • 8
  • 20