13

I am in the middle of parsing in a large amount of csv data. The data is rather "dirty" in that I have inconsistent delimiters, spurious characters and format issues that cause problems for read_csv().

My problem here, however, is not the dirtiness of the data, but just trying to understand the parsing errors that read_csv() is giving me. If I can better understand the error messages, I can then do some janitorial work to fix the problem with scripts. The size of the data makes a manual approach intractable.

Here's a minimal example. Suppose I have a csv file like this:

"col_a","col_b","col_c"
"1","a quick","10"
"2","a quick "brown" fox","20"
"3","quick, brown fox","30"

Note that there's spurious quotes around "brown" in the 2nd row. This content goes into a file called "my_data.csv".

When I try to read that file, I get some parsing failures.

> library(tidyverse)
> df <- read_csv("./my_data.csv", col_types = cols(.default = "c"))
Warning: 2 parsing failures.
row # A tibble: 2 x 5 col     row   col           expected actual            file expected   <int> <chr>              <chr>  <chr>           <chr> actual 1     2 col_b delimiter or quote      b './my_data.csv' file 2     2 col_b delimiter or quote        './my_data.csv'

As you can see, the parsing failure has not been "pretty printed". It is ONE LONG LINE of 271 characters.

I can't figure out where to even put linebreaks in the failure message to see where the problem is and what the message is trying to tell me. Moreover, it refers to a "2x5 tibble". What tibble? My data frame is 3x3.

Can someone show me how to format or put linebreaks in the message from read_csv() so I can see how it is detecting the problem?

Yes, I know what the problem is in this particular minimal example. In my actual data I am dealing with large amounts of csv (~1M rows), peppered with inconsistencies that shower me with hundreds of parsing failures. I'd like to setup a workflow for categorizing these and dealing with them programmatically. The first step, I think, is just understanding how to "parse" the parsing failure message.

Angelo
  • 2,936
  • 5
  • 29
  • 44
  • 2
    Try it with `read.csv` rather than `read_csv`. Your example gives a more sensible answer with the former, so it might work better on your full dataset. – Andrew Gustar Oct 16 '17 at 20:55
  • 1
    If your file isn't a proper CSV file, then `read_csv` isn't going to be that helpful. Just like with code, the error messages you get doesn't always address what's *really* wrong. Maybe try a [CSV validator](https://stackoverflow.com/questions/6738996/are-there-known-services-to-validate-a-csv-file). – MrFlick Oct 16 '17 at 21:00
  • @MrFlick, thanks for the tip! I very well might need to do that when dealing with this quantity of data. I was hoping to be able to understand the message from read_csv(), however. – Angelo Oct 16 '17 at 21:04
  • 1
    If this is specific to `read_csv`, then you should remove references to `read.csv` because that is confusing (they are two completely different functions). The latter of which is from the `readr` package which is part of the "tidyverse" in which [tibbles](http://tibble.tidyverse.org/) replace data.frames. If you have imbalanced quotes, all the newlines are included in the last open quote so one value may take up many, many lines in your file. There's no way for R to know where to break that up. – MrFlick Oct 16 '17 at 21:12
  • @MrFlick, yes, changed everything to read_csv(). I am working in readr. – Angelo Oct 16 '17 at 21:18
  • 2
    At worst you can use `readLines` to read each line as a whole, then you can try to detect lines with problems and clean them up. – Gregor Thomas Oct 16 '17 at 21:18
  • 1
    Had similar problems a while back and I'd go with @Gregor. You could also try using "sed" which is a bash csv utility with data.table. See: https://stackoverflow.com/questions/22229109/r-data-table-fread-command-how-to-read-large-files-with-irregular-separators – detroyejr Oct 16 '17 at 21:54
  • While I can't decode the error message format, it's caused by the use of quotes `"` within quotes. The easy fix is to remove all quotes, or replace the inner quotes with `'`. Look at the 7th argument in ?read_csv. – Brian Oct 16 '17 at 21:54

1 Answers1

17

After taking a breath and looking at the actual documentation, I see there is a way to get the parsing failures from read_csv() in a form that is very usable.

All I had to do to get the parsing failures was to use problems().

> library(tidyverse)
> df <- read_csv("./my_data.csv", col_types = cols(.default = "c"))
Warning: 2 parsing failures.
row # A tibble: 2 x 5 col     row   col           expected actual            file expected   <int> <chr>              <chr>  <chr>           <chr> actual 1     2 col_b delimiter or quote      b './my_data.csv' file 2     2 col_b delimiter or quote        './my_data.csv'

> parsing_failures <- problems(df)
> parsing_failures
# A tibble: 2 x 5
    row   col           expected actual            file
  <int> <chr>              <chr>  <chr>           <chr>
1     2 col_b delimiter or quote      b './my_data.csv'
2     2 col_b delimiter or quote        './my_data.csv'

Apparently read_csv() associates a tibble containing parsing failure details and this is accessible by passing the result from read_csv to problems().

Angelo
  • 2,936
  • 5
  • 29
  • 44