2

I would like to read the a CSV into R that is quoted, comma-separated (i.e. sep = "," not sep = ";" as read.csv2 defaults to) but that

  • uses the comma inside fields as the decimal separator
  • contains periods to separate each group of three digits from the right

An example of a problematic entry is "3.051,00" in the final line of the excerpt from the CSV shown.

I tried

dat <- read.csv2("path_to_csv.csv", sep = ",", stringsAsFactors = FALSE)

and a variant using read.csv (both are identical except for their defaults as noted in Difference between read.csv() and read.csv2() in R. Both return improperly-formatted data.frames (e.g. containing 3.051,00).

Can I read this comma-separated file in directly with read.table without having to perform text-preprocessing?

Excerpt of CSV

praf,pmek,plcg,PIP2,PIP3,p44/42,pakts473,PKA,PKC,P38,pjnk
"26,40","13,20","8,82","18,30","58,80","6,61","17,00","414,00","17,00","44,90","40,00"
"35,90","16,50","12,30","16,80","8,13","18,60","32,50","352,00","3,37","16,50","61,50"
"59,40","44,10","14,60","10,20","13,00","14,90","32,50","403,00","11,40","31,90","19,50"
"62,10","51,90","13,60","30,20","10,60","14,30","37,90","692,00","6,49","25,00","91,40"
"75,00","33,40","1,00","31,60","1,00","19,80","27,60","505,00","18,60","31,10","7,64"
"20,40","15,10","7,99","101,00","35,90","9,14","22,90","400,00","11,70","22,70","6,85"
"47,80","19,60","17,50","33,10","82,00","17,90","35,20","956,00","22,50","43,30","20,00"
"59,90","53,30","11,80","77,70","12,90","11,10","37,90","1.407,00","18,80","29,40","16,80"
"46,60","27,10","12,40","109,00","21,90","21,50","38,20","207,00","11,00","31,30","12,00"
"51,90","21,30","49,10","58,80","10,80","58,80","200,00","3.051,00","15,30","39,20","15,70"

Note: I am aware of the question European and American decimal format for thousands, which is not sufficient. This user preprocesses the file they want to read in whereas I would like a direct means of reading a CSV of the kind shown into R.

Anil
  • 1,097
  • 7
  • 20
  • 1
    Related post: https://stackoverflow.com/q/1523126/680068 – zx8754 Feb 07 '22 at 17:29
  • @zx8754 Thank you for sign-posting this. In particular, the answer https://stackoverflow.com/a/3611619/6619692 and the related question [Most elegant way to load csv with point as thousands separator in R](https://stackoverflow.com/questions/30218209/most-elegant-way-to-load-csv-with-point-as-thousands-separator-in-r) provide good answers, but I was hoping for a means of avoiding all regex since I would have thought of this as a more common problem. – Anil Feb 07 '22 at 17:32

2 Answers2

4

Most of it is resolved with dec=",",

# saved your data to 'file.csv'
out <- read.csv("file.csv", dec=",")
head(out)
#   praf pmek  plcg  PIP2  PIP3 p44.42 pakts473    PKA   PKC  P38  pjnk
# 1 26.4 13.2  8.82  18.3 58.80   6.61     17.0 414,00 17.00 44.9 40.00
# 2 35.9 16.5 12.30  16.8  8.13  18.60     32.5 352,00  3.37 16.5 61.50
# 3 59.4 44.1 14.60  10.2 13.00  14.90     32.5 403,00 11.40 31.9 19.50
# 4 62.1 51.9 13.60  30.2 10.60  14.30     37.9 692,00  6.49 25.0 91.40
# 5 75.0 33.4  1.00  31.6  1.00  19.80     27.6 505,00 18.60 31.1  7.64
# 6 20.4 15.1  7.99 101.0 35.90   9.14     22.9 400,00 11.70 22.7  6.85

Only one column is string:

sapply(out, class)
#        praf        pmek        plcg        PIP2        PIP3      p44.42    pakts473         PKA         PKC         P38 
#   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" "character"   "numeric"   "numeric" 
#        pjnk 
#   "numeric" 

This can be resolved post-read with:

ischr <- sapply(out, is.character)
out[ischr] <- lapply(out[ischr], function(z) as.numeric(gsub(" ", "", chartr(",.", ". ", z))))
out$PKA
#  [1]  414  352  403  692  505  400  956 1407  207 3051

If you'd rather read it in without post-processing, you can pipe(.) it, assuming you have sed available[^1]:

out <- read.csv(pipe("sed -E 's/([0-9])[.]([0-9])/\\1\\2/g;s/([0-9]),([0-9])/\\1.\\2/g' < file.csv"))

Notes:

  1. sed is generally available on all linux/macos systems, and on windows computers it is included within Rtools.
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Upvoted. Thank you for this excellent answer. If no answers that avoid regex wholesale are posted, I will accept this answer in 2-3 days. – Anil Feb 07 '22 at 17:34
  • 1
    I understand your plight, and will be impressed if this can be done without any form of regex. This one does have risks, you're probably aware since you're trying to avoid them. However, I think the only way to really preclude the use of regex is to change how the data is exported at the source. If you have control over that, it's a far better/safer alternative. – r2evans Feb 07 '22 at 17:54
1

Like r2evans's comment says, dec = "," takes care of the cases without thousands separators. Then use lapply/gsub to process the other cases, which are still of class "character".

txt <- '
praf,pmek,plcg,PIP2,PIP3,p44/42,pakts473,PKA,PKC,P38,pjnk
"26,40","13,20","8,82","18,30","58,80","6,61","17,00","414,00","17,00","44,90","40,00"
"35,90","16,50","12,30","16,80","8,13","18,60","32,50","352,00","3,37","16,50","61,50"
"59,40","44,10","14,60","10,20","13,00","14,90","32,50","403,00","11,40","31,90","19,50"
"62,10","51,90","13,60","30,20","10,60","14,30","37,90","692,00","6,49","25,00","91,40"
"75,00","33,40","1,00","31,60","1,00","19,80","27,60","505,00","18,60","31,10","7,64"
"20,40","15,10","7,99","101,00","35,90","9,14","22,90","400,00","11,70","22,70","6,85"
"47,80","19,60","17,50","33,10","82,00","17,90","35,20","956,00","22,50","43,30","20,00"
"59,90","53,30","11,80","77,70","12,90","11,10","37,90","1.407,00","18,80","29,40","16,80"
"46,60","27,10","12,40","109,00","21,90","21,50","38,20","207,00","11,00","31,30","12,00"
"51,90","21,30","49,10","58,80","10,80","58,80","200,00","3.051,00","15,30","39,20","15,70"
'

df1 <- read.csv(textConnection(txt), dec = ",")

i <- sapply(df1, is.character)
df1[i] <- lapply(df1[i], \(x) gsub("\\.", "", x))
df1[i] <- lapply(df1[i], \(x) as.numeric(sub(",", ".", x)))
df1
#>    praf pmek  plcg  PIP2  PIP3 p44.42 pakts473  PKA   PKC  P38  pjnk
#> 1  26.4 13.2  8.82  18.3 58.80   6.61     17.0  414 17.00 44.9 40.00
#> 2  35.9 16.5 12.30  16.8  8.13  18.60     32.5  352  3.37 16.5 61.50
#> 3  59.4 44.1 14.60  10.2 13.00  14.90     32.5  403 11.40 31.9 19.50
#> 4  62.1 51.9 13.60  30.2 10.60  14.30     37.9  692  6.49 25.0 91.40
#> 5  75.0 33.4  1.00  31.6  1.00  19.80     27.6  505 18.60 31.1  7.64
#> 6  20.4 15.1  7.99 101.0 35.90   9.14     22.9  400 11.70 22.7  6.85
#> 7  47.8 19.6 17.50  33.1 82.00  17.90     35.2  956 22.50 43.3 20.00
#> 8  59.9 53.3 11.80  77.7 12.90  11.10     37.9 1407 18.80 29.4 16.80
#> 9  46.6 27.1 12.40 109.0 21.90  21.50     38.2  207 11.00 31.3 12.00
#> 10 51.9 21.3 49.10  58.8 10.80  58.80    200.0 3051 15.30 39.2 15.70

Created on 2022-02-07 by the reprex package (v2.0.1)

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66