1

I am working with a csv file which was originally formatted in excel. I want to convert the rate column into numeric and remove the "$" sign.

I read in the file with : > NImp <- read.csv("National_TV_Spots 6_30_14 to 8_31_14.csv", sep=",", header=TRUE, stringsAsFactors=FALSE, strip.white=TRUE, na.strings=c("Not Monitored"))

The data frame looks like this:

HH.IMP..000.       ISCI                                          Creative          Program  Rate
1           NA     IT3896 Rising Costs30 (Opportunity Scholar - No Nursing)      NUVO CINEMA $0.00
2           NA     IT3896 Rising Costs30 (Opportunity Scholar - No Nursing)      NUVO CINEMA $0.00
3          141    IT14429 Rising Costs30 (Opportunity Scholar - No Nursing)            BONUS $0.00
4          476 ITES15443H     Matthew Traina (B. EECT/A. CEET) :60 (no loc) Law & Order: SVU $0.00
5           NA     IT3896 Rising Costs30 (Opportunity Scholar - No Nursing)      NUVO CINEMA $0.00

When I do do the conversion, I get an error message: > NImp$Rate <- as.numeric(gsub("$","", NImp$Rate)) Warning message: NAs introduced by coercion and all values get coerced to NAs.

I also tried, NImp$Rate <- as.numeric(sub("\\$","", NImp$Rate)) but again got the same warning message. However not all values became NAs - only specific ones. I opened the csv in excel to check and I realized that excel forces csv column width too narrow resulting in "####" cells. These cells are being coerced to "NA" by r.

I tried the option of opening the file in notepad and read the notepad file into r. But I get the same results. The values are correctly displayed in both Notepad and when I read the file into r. But when I change to numeric, everything that shows as "####" in excel, becomes NA.

What should I do?

Adding str(NImp)

'data.frame':   9859 obs. of  19 variables:
$ Spot.ID         : int  13072903 13072904 13072898 13072793 13072905 13072899 13072397 13072476 13072398 13072681 ...
$ Date            : chr  "6/30/2014" "6/30/2014" "6/30/2014" "6/30/2014" ...
$ Hour            : int  0 0 0 0 0 0 1 1 1 2 ...
$ Time            : chr  "12:08 AM" "12:20 AM" "12:29 AM" "12:30 AM" ...
$ Local.Date      : chr  "6/30/2014" "6/30/2014" "6/30/2014" "6/30/2014" ...
$ Broadcast.Week  : int  1 1 1 1 1 1 1 1 1 1 ...
$ Local.Hour      : int  0 0 0 0 0 0 1 1 1 2 ...
$ Local.Time      : chr  "12:08 AM" "12:20 AM" "12:29 AM" "12:30 AM" ...
$ Market          : chr  "NATIONAL CABLE" "NATIONAL CABLE" "NATIONAL CABLE" "NATIONAL CABLE" ...
$ Vendor          : chr  "NUVO" "NUVO" "AFAM" "USA" ...
$ Station         : chr  "NUVO" "NUVO" "AFAM" "USA" ...
$ M18.34.IMP..000.: int  NA NA 3 88 NA 3 NA 53 NA 37 ...
$ W18.34.IMP..000.: int  NA NA 86 66 NA 86 NA 70 NA 60 ...
$ A18.34.IMP..000.: int  NA NA 89 154 NA 89 NA 123 NA 97 ...
$ HH.IMP..000.    : int  NA NA 141 476 NA 141 NA 461 NA 434 ...
$ ISCI            : chr  "IT3896" "IT3896" "IT14429" "ITES15443H" ...
$ Creative        : chr  "Rising Costs30 (Opportunity Scholar - No Nursing)" "Rising Costs30 (Opportunity Scholar - No Nursing)" "Rising Costs30 (Opportunity Scholar - No Nursing)" "Matthew Traina (B. EECT/A. CEET) :60 (no loc)" ...
$ Program         : chr  "NUVO CINEMA" "NUVO CINEMA" "BONUS" "Law & Order: SVU" ...
$ Rate            : chr  "$0.00" "$0.00" "$0.00" "$0.00" ...
vagabond
  • 3,526
  • 5
  • 43
  • 76
  • 1
    I can't reproduce your problem. `as.numeric(sub("\\$", "", df$Rate))` works for me – Rich Scriven Oct 17 '14 at 16:10
  • 1
    I'm not sure what causes the problem, but 1) "\\$" is the right regular expression to use ($ just matches the end of a string) and 2) the way excel displays the csv file is irrelevant, so don't worry about the #### – konvas Oct 17 '14 at 16:10
  • The "####" cell is actually "$1200.00". The csv files are formatted in excel. IF I save it as notepad, the rows and columns go for a toss - something to do with a comma value I guess. Excel reduces the column width of "$1200.00" and it becomes "####" which `r` reads as `na`. Its confounding. – vagabond Oct 17 '14 at 16:16
  • 2
    The ### is just the way Excel displays it - if you increase the width of the column it will show the actual value, so it's irrelevant. Check your export options when you export to csv from Excel , what do you mean "the rows and columns go for a toss"? – konvas Oct 17 '14 at 16:17
  • No, It appears fine in Notepad and in `r`. But the moment i change to numeric and remove $ in `r`, I still get NAs where the CSV showed #### in excel. Even if i read in the csv from notepad. – vagabond Oct 17 '14 at 16:23
  • OK, can you post the output of `str(NImp)` after reading it in R and before you do any operations on it? – konvas Oct 17 '14 at 16:27
  • done ! I just tried saving the using Google Drive into CSV . But same problem. any cell which excel forces to "####" by reducing the column width becomes "NA" . – vagabond Oct 17 '14 at 16:42
  • 1
    I can reproduce this when the value in the csv file looks like `$1,200.00`. If you remove the dollar sign and try to make this numeric you will get `NA`. You need to remove the commas, as well, before you can convert to numeric. You can do this with `as.numeric(gsub("[$,]", "", ...))` (found code in a comment to [this answer](http://stackoverflow.com/a/7338251/2461552)). – aosmith Oct 17 '14 at 17:25
  • @aosmith That worked ! post it as an answer. – vagabond Oct 17 '14 at 18:02

1 Answers1

1

When a column was set as "Currency" in Excel, the values in the thousands or greater have a comma in them as well as the dollar sign prefix. For example, a value might look like $1,200.00. The problem you were having was because you were removing the dollar signs but not the commas, so when you tried to convert to numeric you get NA.

as.numeric(c("0", "0", "1,200"))
[1]  0  0 NA
Warning message:
NAs introduced by coercion 

You can remove the dollar signs and commas in one step using gsub. I found an example of how to do this in a comment to this answer.

as.numeric(gsub("[$,]", "", c("$0", "$0", "$1,200")))
[1]    0    0 1200

So the code that should work for your dataset is

as.numeric(gsub("[$,]", "", NImp$Rate))
Community
  • 1
  • 1
aosmith
  • 34,856
  • 9
  • 84
  • 118
  • that's exactly what happened. Every number with Comma had that problem. – vagabond Oct 17 '14 at 18:16
  • related: I have certain values in the NImp$Rate column which are already plain numeric like "4167". Applying above function turns them into NAs. Any solution to selectively apply? – vagabond Oct 29 '14 at 16:52
  • @vagabond I can't reproduce - `as.numeric(gsub("[$,]", "", c("$0", "$0", "$1,200", "4167")))` works for me. Maybe ask a new question? – aosmith Oct 29 '14 at 18:03