1

I apologize if the below described issue is vague in the sense that there is not a reproducible example, but the problem has to do with the formatting of a .csv file.

I am trying to open several .csv files into R from my laptop using the read.csv function. Current arguments used in the function are:

read.csv(filename, row.names = NULL, stringsAsFactors = F) 

row.names is set as NULL because I was getting another error before (duplicated row.names).

When importing some of them, I was getting the following warning:

Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string

I checked the data frame that results from an input with this warning, and I noticed that many rows (in the case I am describing, 10% of the total number of roads, approx) had been replaced by NAs. In the following ascii text generated with dput(), I show the row with NA values, and the rows on top and below of it (the link is to an image of the data frame as it looks).

data <- structure(list(borough_17 = c("MN", "0", "MN"), block_17 = c(1602L,  NA, 1602L), lot_17 = c(57L, NA, 1L), cd_17 = c(111L, NA, 111L ), ct2010_17 = c(160.02, NA, 160.02), cb2010_17 = c(1001L, NA,  1001L), schooldist_17 = c(2L, NA, 2L), council_17 = c(4L, NA,  4L), zipcode_17
= c(10029L, NA, 10128L), firecomp_17 = c("E053",  "", "E053"), policeprct_17 = c(23, NA, 23), healthcent_17 = c(12L,  NA, 12L), healtharea_17 = c(2820L, NA, 2820L), sanitboro_17 = c(1L,  NA, 1L), sanitdistr_17 = c(11L, NA, 11L), sanitsub_17 = c("1A",  "", "1A"), address_17 = c("1392 MADISON AVENUE", "", "1150 5 AVENUE" ), zonedist1_17 = c("R7-2", "", "R10"), zonedist2_17 = c(NA,  "", NA), zonedist3_17 = c(NA_integer_, NA_integer_, NA_integer_ ), zonedist4_17
= c(NA, "", NA), overlay1_17 = c("C1-5", "",  NA), overlay2_17 = c(NA, "", NA), spdist1_17 = c(NA, "", "PI" ), spdist2_17 = c(NA, "", NA), spdist3_17 = c(NA_integer_, NA_integer_,  NA_integer_), ltdheight_17 = c(NA, NA, NA), splitzone_17 = c("N",  "", "N"), bldgclass_17 = c("C7", "", "D4"), landuse_17 = c(4L,  NA, 3L), easements_17 = c(0, NA, 0), ownertype_17 = c("P", "",  NA), ownername_17 = c("\\1392 ASSOCIATES", "", "1150 FIFTH AVE OWNERS" ), lotarea_17 = c(" LLC,4892,20488,1300,19188,0,1300,0,0,0,0,2,1,6,27,30,50,103.67,50,86,NA,3,Y,5,2,234000,1781100,0,0,1910,1988,0,NA,NA,4.19,3.44,0,6.5,1,1016020057,0,16002,996952,226284,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,303.48084987,5045.80705517,0,0\nMN,1602,40,111,160.02,1000,2,4,10029,E053,23,12,2820,1,11,1Q,68 EAST 97 STREET,R7-2,NA,NA,NA,C1-5,NA,NA,NA,NA,NA,N,C1,02,0,NA,MSMC RESIDENTIAL REAL,5046,21288,2767,18521,0,0,0,0,0,2767,2,1,6,30,30,50,100.92,50,88,NA,3,N,5,2,98100,722250,19620,144450,1920,1988,0,NA,NA,4.22,3.44,0,6.5,1,1016020040,0,16002,997361,226090,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,311.056597351,5263.55635194,0,0\nMN,1602,12,111,160.02,1001,2,4,10128,E053,23,12,2820,1,11,1A,15 EAST 96 STREET,R10,NA,NA,NA,NA,NA,PI,NA,NA,NA,N,A7,01,0,P,ESI PROPERTIES L.L.C.,3784,10899,0,10899,0,0,0,0,0,0,2,1,6,1,1,37.5,100.92,37,100,NA,3,N,5,2,99694,312592,0,0,1915,0,0,Expanded Carnegie Hill Historic District,LUCY D. DAHLGREN HOUSE,2.88,10,0,10,1,1016020012,0,16002,996835,226261,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,284.454033558,3945.34293573,0,0\nMN,1602,50,111,160.02,1000,2,4,10029,E053,23,12,2820,1,11,1A,1391 MADISON AVENUE,R7-2,NA,NA,NA,C1-5,NA,NA,NA,NA,NA,N,D7,04,0,P,MSMC RESIDENTIAL REAL,10000,44184,8673,35511,0,6000,0,0,0,2673,2,1,6,43,47,100,100,100,91,NA,2,N,3,2,293400,3357000,50792,695697,1920,0,0,NA,NA,4.42,3.44,0,6.5,1,1016020050,0,16002,997119,226224,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,420.450320951,11048.1944701,0,0\nMN,1602,58,111,160.02,1001,2,4,10029,E053,23,12,2820,1,11,1A,1396 MADISON AVENUE,R7-2,NA,NA,NA,C1-5,NA,NA,NA,NA,NA,N,C7,04,0,NA,\\1392 ASSOCIATES, LLC",  "", "15138"), bldgarea_17 = c("4415", "", "163969"), comarea_17 = c(21072,  NA, 1000), resarea_17 = c(1500, NA, 162969), officearea_17 = c(19572,  NA, 1000), retailarea_17 = c(0L, NA, 0L), garagearea_17 = c(1500L,  NA, 0L), strgearea_17 = c(0L, NA, 0L), factryarea_17 = c(0L,  NA, 0L), otherarea_17 = c(0L, NA, 0L), areasource_17 = c(0L,  NA, 2L), numbldgs_17 = c("2", "", "1"), numfloors_17 = c(1, NA,  15), unitsres_17 = c("6", "", "74"), unitstotal_17 = c(28L, NA,  77L), lotfront_17 = c(32, NA, 100.92), lotdepth_17 = c(50.92,  NA, 150), bldgfront_17 = c(81.42, NA, 100), bldgdepth_17 = c(50,  NA, 142), ext_17 = c("73", "", NA), proxcode_17
= c(NA, "", "1" ), irrlotcode_17 = c("2", "", "N"), lottype_17 = c("Y", "", "3" ), bsmtcode_17 = c("3", "", "2"), assessland_17 = c(2, NA, 769500 ), assesstot_17 = c(278100L, NA, 18590400L), exemptland_17
= c(2135700,  NA, 9440), exempttot_17 = c(0L, NA, 9440L), yearbuilt_17 = c(0L,  NA, 1924L), yearalter1_17 = c(1910L, NA, 1988L), yearalter2_17 = c(0L,  NA, 0L), histdist_17 = c("0", "", "Expanded Carnegie Hill Historic District" ), landmark_17 = c(NA, "", NA), builtfar_17 = c(NA, "", "10.83" ), residfar_17 = c(4.77, NA, 10), commfar_17 = c(3.44, NA, 0), 
    facilfar_17 = c(0, NA, 10), borocode_17 = c(6.5, NA, 1), 
    bbl_17 = c(1, NA, 1016020001), condono_17 = c(1016020058L, 
    NA, 0L), tract2010_17 = c(0L, NA, 16002L), xcoord_17 = c(16002L, 
    NA, 996653L), ycoord_17 = c(996984L, NA, 226361L), zonemap_17 = c("226326", 
    "", "6b"), zmcode_17 = c("6b", "", NA), sanborn_17 = c(NA, 
    "", "108S043"), taxmap_17 = c("108S043", "", "10602"), edesignum_17 = c("10602", 
    "", NA), appbbl_17 = c(NA, "", "0"), appdate_17 = c("0", 
    "", NA), plutomapid_17 = c(NA, "", "1"), version_17 = c("1", 
    "", "17V1"), mappluto_f_17 = c("17V1", "", "0"), shape_leng_17 = c("0", 
    "", "514.691112825"), shape_area_17 = c(285.694168916, NA, 
    15970.2343838), sfha_07_17 = c("4794.40305456", "", "0"), 
    sfha_15_17 = c(0, NA, 0)), .Names = c("borough_17", "block_17",  "lot_17", "cd_17", "ct2010_17", "cb2010_17", "schooldist_17",  "council_17", "zipcode_17", "firecomp_17", "policeprct_17", "healthcent_17",  "healtharea_17", "sanitboro_17", "sanitdistr_17", "sanitsub_17",  "address_17", "zonedist1_17", "zonedist2_17", "zonedist3_17",  "zonedist4_17", "overlay1_17", "overlay2_17", "spdist1_17", "spdist2_17",  "spdist3_17", "ltdheight_17", "splitzone_17", "bldgclass_17",  "landuse_17", "easements_17", "ownertype_17", "ownername_17",  "lotarea_17", "bldgarea_17", "comarea_17", "resarea_17", "officearea_17",  "retailarea_17", "garagearea_17", "strgearea_17", "factryarea_17",  "otherarea_17", "areasource_17", "numbldgs_17", "numfloors_17",  "unitsres_17", "unitstotal_17", "lotfront_17", "lotdepth_17",  "bldgfront_17", "bldgdepth_17", "ext_17", "proxcode_17", "irrlotcode_17",  "lottype_17", "bsmtcode_17", "assessland_17", "assesstot_17",  "exemptland_17", "exempttot_17", "yearbuilt_17", "yearalter1_17",  "yearalter2_17", "histdist_17", "landmark_17", "builtfar_17",  "residfar_17", "commfar_17", "facilfar_17", "borocode_17", "bbl_17",  "condono_17", "tract2010_17", "xcoord_17", "ycoord_17", "zonemap_17",  "zmcode_17", "sanborn_17", "taxmap_17", "edesignum_17", "appbbl_17",  "appdate_17", "plutomapid_17", "version_17", "mappluto_f_17",  "shape_leng_17", "shape_area_17", "sfha_07_17", "sfha_15_17"), row.names = 13:15, class = "data.frame")

Row with NA values

I studied my dataframe and found out that something seems to be wrong at the field "lotarea_17". In this field, the first record is a string of many values. commas have not been separated and many columns have remained in a single one (image below shows the dataframe as it looks):

A look at the row on top of the NA-filled row

I have checked in the .csv file, and I have realized that the problem takes place when there is a \ symbol in a specific column of the preceeding row to the one that gets filled with NAs. Probably, this symbol is cancelling one of the " signs, leading to the whole row be taken as a string. Below, I show two images of a .csv file, one of a dataset with the described issue and another one that worked well and had the same record, but written differently. It seems like the coma separated values of the row that gives problems says:

...,"\"1392 ASSOCIATES, LLC",4892,20488,1300,19188,...

While the one that worked shows:

"...,1392 ASSOCIATES, LLC",4892,20488,1300,19188,...

csv with the problem A csv file with the same row, that worked when importing it

Hence, my question is about whether there could be a quick fix, either in R or Excel, (preferably in R at the read.csv function), to automatically get over this issue in which fragment "\" is mssing up with the structure of my data. Manual fix is not an option, since we are talking about more than 200 cases in only one df (and i have a few more with this issue).

Addition - the .csv files were created by writing the data frames of several sf objects. The command line was `write.table(as.data.frame(sf_object), filename, sep = ",", row.names = F. The sf objects were created by spatially joining the shapefiles of MapPLUTO (NYC's land use maps database) and another shapefile (but I know that there is no problem coming from that join because the columns inherited from there are perfect).

  • 1
    Please provide example data as plain text using _e.g._ `head()` or `dput()`, not as images. – neilfws Dec 13 '18 at 02:16
  • Hi @neilfws. How do you mean I should use head() to provide example data? I will be happy to do so. Feel free to directly reach me if you think it will be easier. – Pablo Herreros Cantis Dec 13 '18 at 02:32
  • Your question doesn't specify what function you are using to read the csv file making it hard to answer. Also please don't expect people to follow links to unknown places but instead post some actual example data. – Elin Dec 13 '18 at 02:38
  • [See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for advice on making your problem reproducible. We cannot copy/paste data from images. – neilfws Dec 13 '18 at 02:39
  • Thank you @neilfws. I will check on that straight away – Pablo Herreros Cantis Dec 13 '18 at 02:40
  • 1
    @Elin just for the record - links are to Imgur and are generated by the stackoverflow platform when I try to add an image myself, because my low rep. points dont allow me to embed images in the text. Cheers! – Pablo Herreros Cantis Dec 13 '18 at 02:42
  • @neilfws How about now? Thank you for the links and the guidance. I also used the example of a dput() outcome here to see what was expected. https://stackoverflow.com/questions/24594981/getting-the-error-level-sets-of-factors-are-different-when-running-a-for-loop – Pablo Herreros Cantis Dec 13 '18 at 03:04
  • If you have a graph that's fine. In the meantime, how are you creating the csv file? – Elin Dec 13 '18 at 03:07
  • Hi @Elin. So the .csv file was saved as such by using the data frame of an sf object. To put you in context, i have 16 shapefiles, that correspond to 16 different years of a land use cadaster of NYC (if you wanna nerd about it, since I saw youre from the area, check MapPLUTO). Some years were saved fine and when I open back the .csv I have no trouble. However, some years do give me the problem. Since the dataframes were constructed in loops and with functions, I am figuring that the error comes from the original data. – Pablo Herreros Cantis Dec 13 '18 at 03:09
  • If you add `allowEscapes=TRUE` in the `read.csv`-call, does that make a difference? – Emil Bode Dec 13 '18 at 03:16
  • Hi @EmilBode! it does make a difference - but nor a good one. After running the read.csv with that argument, the loaded DF has the issue that, instead of a row being filled with NAs, every **cell** after the one containing the _"\"1392 ASSOCIATES, LLC"_ record is filled with what looks to be a whole row or more of data. Let me know if I made this clear - feels a bit sloppy. – Pablo Herreros Cantis Dec 13 '18 at 03:41
  • Okay sf files are going to generate a csv that's different than if you did it from an excel file. That probably explains why you are seeing a csv that looks so unusual (at least to me). For example sometimes there seem to be quotation marks that make sense but other times not. Are you using the csv files from the mapPLUTO page? Let's add a tag for sf because maybe we will get someone who knows how to work with shape files. – Elin Dec 13 '18 at 04:00
  • @Elin - The .csv files are the generated by using the MapPLUTO shapefiles in R as sf to perform a spatial join with another shapefile (but I know that there is no problem coming from that join because the columns inherited from there are perfect). I have stumbled across this answer and it looks super promising (a quick test has shown that it probably works). I will test it in depth tomorrow and share the results! thanks for helping out. https://stackoverflow.com/questions/30142183/import-of-csv-data-ignoring-the-separator – Pablo Herreros Cantis Dec 13 '18 at 04:06
  • https://r-spatial.github.io/sf/reference/st_write.html#examples are you using st_write()? Ok good! – Elin Dec 13 '18 at 04:09
  • i was facing the same issue so tried using fread from data.table and it was working fine – Hunaidkhan Dec 13 '18 at 04:14
  • @Elin since once I did the join I didn't need the geometries anymore, I used `write.table(as.data.frame(sf_object), filename, sep = ",", row.names = F` – Pablo Herreros Cantis Dec 13 '18 at 04:19
  • So this is what you need to **add to the question**, how your file was created. When I first read this question I made a small example file on the assumption that you had a simple csv file with no list columns. If you leave out all the details people are going to answer different questions based on their reading of your question. – Elin Dec 13 '18 at 04:52
  • Noted @Elin. Thanks for the feedback (and patience). Will add that to the question later on today. – Pablo Herreros Cantis Dec 13 '18 at 12:44

1 Answers1

3

It looks like the csv-files you are trying to read don't comply with the official csv-standard (RFC-4180), so I think it's easiest if we brute-force our way through it.

We can use readLines to read the raw data, gsub to edit the non-compliant parts, and then use read.csv(text=...) to transform into a data.frame. I'm not 100 % sure it will work on the first try, but this way it's easier to debug. What works for a small example:

raw <- readLines(filename)
raw <- gsub('\\\\"', '""', raw)
result <- read.csv(text=raw)
Emil Bode
  • 1,784
  • 8
  • 16
  • Thanks Emil! works like a charm. Could you explain why you used so many \ characters? I imagine they interact with each other? thank you! – Pablo Herreros Cantis Dec 13 '18 at 15:35
  • 1
    The regular expression is essiantially parsed _twice_, both by languages who interpret the backslash as an escape. We want to look for "\"", but PERL (the regular-expression interpreter that's build in) interprets something like this as an attempt to escape the double quote, just like "\n" means something special (newline). To look for a literal backslash, we need a double backslash. Then again, the R-parser does something similar, so to tell it we want to pass on a literal double backslash, we need to escape both backslashes, so we get 4 in total. – Emil Bode Dec 13 '18 at 15:53
  • Ah! i see! for what I understand, `gsub('\\\\" ', '""', raw)` will find `\"` within hte string, and replace it by an empty string. right? Majestic :) – Pablo Herreros Cantis Dec 13 '18 at 16:17
  • Correctemundo! (maybe not an English word, but you get it) – Emil Bode Dec 13 '18 at 16:20
  • 1
    Sorry, not entirely.. It will not replace by an empty string, but with two double quotes, which is the official way to escape a double quote in a csv file. So then `read.csv` understands it – Emil Bode Dec 13 '18 at 16:32
  • Fair enough. Greetings from a Wageningen student! (saw you are in KNAW) – Pablo Herreros Cantis Dec 13 '18 at 19:03