1

I have a CSV file that has a value with comma inside it : "Afganis,tan" I would like to replace the comma with "_" so I use the following gsub statments. However when I print the outcome dataframe I see that the value is broken into two parts: one is in the same column ("Afganis") and the second ("tan") was inserted to next column. How can I overcome this issue so I'll get "Afganis_tan"? Is there a way to insert the gsub function into the read.csv function ? Attached is the related code that I've tried.

df<- read.csv(inFile$datapath, header = input$header, sep = input$sep, quote = "")
ind<-sapply(1:ncol(df), function(x) { class(df[,x])=="factor"})
      print(ind)
      ind<-which(ind==TRUE)
      if(length(ind)==1){

        df[,ind]<-gsub('[^a-zA-Z0-9.]',"_",as.character(df[,ind]))
        df[,ind]<-df('\\,',"_",as.character(df[,ind]))
        df[,ind]<-df(',',"_",as.character(df[,ind]))
        df[,ind]<-as.factor(df[,ind])
      }

Here is the output (line 9 is the problematic):

  campaign_type              country_name           transaction_type cpa_price       TIERS Year_Success_20.
1            CPM              Afgha nistan                   Domestic         0 TIER_4_Asia                0
2            CPM             Afg' hanistan                   Domestic         0 TIER_4_Asia                0
3            CPM           Afgh' (anistan) Cross Optimized Advertiser         0 TIER_4_Asia                1
4            CPM              Afgha-nistan                   Domestic         0 TIER_4_Asia                0
5            CPM         "Afghan""istan""" Cross Optimized Advertiser         0 TIER_4_Asia                0
6            CPM Afghan~!#$%^@&*()_+|istan                   Domestic         0 TIER_4_Asia                0
7            CPM               Afghanistan                   Domestic         0 TIER_4_Asia                0
8            CPM              Afghanistan.                   Domestic         0 TIER_4_Asia                0
9            CPM                   "Afghan                     istan"  Domestic 0.040995321      TIER_4_Asia
10                                                                                                          
11           CPM               Afghanistan                   Domestic         0 TIER_4_Asia                1

Attached is link to the file. File with versions to the string "Afganistan"

mql4beginner
  • 2,193
  • 5
  • 34
  • 73
  • You should make a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and include sample data. `read.csv` really doesn't have anything to do with the transformation after you've read the data into `df`. Just include a dput of `df`. And what are you trying to do wtih `df('\\,',"_",as.character(df[,ind]))` -- you seem to be calling `df()` as a function. And where does `SPY` come from? – MrFlick Dec 13 '16 at 14:39
  • You could use `fread()` function from the `data.table` library. It's possible to use a terminal command as an input instead of a file path and apply the gsub (or the Windows equivalent) on the way in. You can also set the `sep2` parameter to try and detect quote delimiters. – Oliver Frost Dec 13 '16 at 15:00
  • @MrFlick, I added more info as you have suggested. – mql4beginner Dec 13 '16 at 15:04

1 Answers1

2

If you can save the CSV file to use quote identifiers (uses double quotes to separate fields") then you can avoid this problem. Then read.csv() (or my personal preference fread() from the data.table library) can be programmed to look for commas outside of the double quotes to split.

You can also use a terminal command as input in fread() and use a regular expression on the data as it is piped in.

Oliver Frost
  • 827
  • 5
  • 18
  • Thanks @Oliver Forst, Unfortunately I can't use external libraries. – mql4beginner Dec 13 '16 at 15:09
  • 1
    `system("system command with gsub", intern = TRUE)` could give you what you need. Otherwise you might have to simply read it in and manipulate it in R. Also, take 'abc,def' as an example. How do you distinguish cases where the comma is intentional (and no whitespace is present) compared to cases where the comma is accidental and the field should actually read 'abcdef'? The best solution will be to reload that file with a different delimiter or with quoted identifiers. – Oliver Frost Dec 13 '16 at 15:51
  • Thank you @Oliver Forst, thanks for the workaround suggestion, where do I insert this line of code? – mql4beginner Dec 13 '16 at 16:26