I apologize if the below described issue is vague in the sense that there is not a reproducible example, but the problem has to do with the formatting of a .csv file.
I am trying to open several .csv files into R from my laptop using the read.csv function. Current arguments used in the function are:
read.csv(filename, row.names = NULL, stringsAsFactors = F)
row.names is set as NULL because I was getting another error before (duplicated row.names).
When importing some of them, I was getting the following warning:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
I checked the data frame that results from an input with this warning, and I noticed that many rows (in the case I am describing, 10% of the total number of roads, approx) had been replaced by NAs. In the following ascii text generated with dput(), I show the row with NA values, and the rows on top and below of it (the link is to an image of the data frame as it looks).
data <- structure(list(borough_17 = c("MN", "0", "MN"), block_17 = c(1602L, NA, 1602L), lot_17 = c(57L, NA, 1L), cd_17 = c(111L, NA, 111L ), ct2010_17 = c(160.02, NA, 160.02), cb2010_17 = c(1001L, NA, 1001L), schooldist_17 = c(2L, NA, 2L), council_17 = c(4L, NA, 4L), zipcode_17
= c(10029L, NA, 10128L), firecomp_17 = c("E053", "", "E053"), policeprct_17 = c(23, NA, 23), healthcent_17 = c(12L, NA, 12L), healtharea_17 = c(2820L, NA, 2820L), sanitboro_17 = c(1L, NA, 1L), sanitdistr_17 = c(11L, NA, 11L), sanitsub_17 = c("1A", "", "1A"), address_17 = c("1392 MADISON AVENUE", "", "1150 5 AVENUE" ), zonedist1_17 = c("R7-2", "", "R10"), zonedist2_17 = c(NA, "", NA), zonedist3_17 = c(NA_integer_, NA_integer_, NA_integer_ ), zonedist4_17
= c(NA, "", NA), overlay1_17 = c("C1-5", "", NA), overlay2_17 = c(NA, "", NA), spdist1_17 = c(NA, "", "PI" ), spdist2_17 = c(NA, "", NA), spdist3_17 = c(NA_integer_, NA_integer_, NA_integer_), ltdheight_17 = c(NA, NA, NA), splitzone_17 = c("N", "", "N"), bldgclass_17 = c("C7", "", "D4"), landuse_17 = c(4L, NA, 3L), easements_17 = c(0, NA, 0), ownertype_17 = c("P", "", NA), ownername_17 = c("\\1392 ASSOCIATES", "", "1150 FIFTH AVE OWNERS" ), lotarea_17 = c(" LLC,4892,20488,1300,19188,0,1300,0,0,0,0,2,1,6,27,30,50,103.67,50,86,NA,3,Y,5,2,234000,1781100,0,0,1910,1988,0,NA,NA,4.19,3.44,0,6.5,1,1016020057,0,16002,996952,226284,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,303.48084987,5045.80705517,0,0\nMN,1602,40,111,160.02,1000,2,4,10029,E053,23,12,2820,1,11,1Q,68 EAST 97 STREET,R7-2,NA,NA,NA,C1-5,NA,NA,NA,NA,NA,N,C1,02,0,NA,MSMC RESIDENTIAL REAL,5046,21288,2767,18521,0,0,0,0,0,2767,2,1,6,30,30,50,100.92,50,88,NA,3,N,5,2,98100,722250,19620,144450,1920,1988,0,NA,NA,4.22,3.44,0,6.5,1,1016020040,0,16002,997361,226090,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,311.056597351,5263.55635194,0,0\nMN,1602,12,111,160.02,1001,2,4,10128,E053,23,12,2820,1,11,1A,15 EAST 96 STREET,R10,NA,NA,NA,NA,NA,PI,NA,NA,NA,N,A7,01,0,P,ESI PROPERTIES L.L.C.,3784,10899,0,10899,0,0,0,0,0,0,2,1,6,1,1,37.5,100.92,37,100,NA,3,N,5,2,99694,312592,0,0,1915,0,0,Expanded Carnegie Hill Historic District,LUCY D. DAHLGREN HOUSE,2.88,10,0,10,1,1016020012,0,16002,996835,226261,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,284.454033558,3945.34293573,0,0\nMN,1602,50,111,160.02,1000,2,4,10029,E053,23,12,2820,1,11,1A,1391 MADISON AVENUE,R7-2,NA,NA,NA,C1-5,NA,NA,NA,NA,NA,N,D7,04,0,P,MSMC RESIDENTIAL REAL,10000,44184,8673,35511,0,6000,0,0,0,2673,2,1,6,43,47,100,100,100,91,NA,2,N,3,2,293400,3357000,50792,695697,1920,0,0,NA,NA,4.42,3.44,0,6.5,1,1016020050,0,16002,997119,226224,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,420.450320951,11048.1944701,0,0\nMN,1602,58,111,160.02,1001,2,4,10029,E053,23,12,2820,1,11,1A,1396 MADISON AVENUE,R7-2,NA,NA,NA,C1-5,NA,NA,NA,NA,NA,N,C7,04,0,NA,\\1392 ASSOCIATES, LLC", "", "15138"), bldgarea_17 = c("4415", "", "163969"), comarea_17 = c(21072, NA, 1000), resarea_17 = c(1500, NA, 162969), officearea_17 = c(19572, NA, 1000), retailarea_17 = c(0L, NA, 0L), garagearea_17 = c(1500L, NA, 0L), strgearea_17 = c(0L, NA, 0L), factryarea_17 = c(0L, NA, 0L), otherarea_17 = c(0L, NA, 0L), areasource_17 = c(0L, NA, 2L), numbldgs_17 = c("2", "", "1"), numfloors_17 = c(1, NA, 15), unitsres_17 = c("6", "", "74"), unitstotal_17 = c(28L, NA, 77L), lotfront_17 = c(32, NA, 100.92), lotdepth_17 = c(50.92, NA, 150), bldgfront_17 = c(81.42, NA, 100), bldgdepth_17 = c(50, NA, 142), ext_17 = c("73", "", NA), proxcode_17
= c(NA, "", "1" ), irrlotcode_17 = c("2", "", "N"), lottype_17 = c("Y", "", "3" ), bsmtcode_17 = c("3", "", "2"), assessland_17 = c(2, NA, 769500 ), assesstot_17 = c(278100L, NA, 18590400L), exemptland_17
= c(2135700, NA, 9440), exempttot_17 = c(0L, NA, 9440L), yearbuilt_17 = c(0L, NA, 1924L), yearalter1_17 = c(1910L, NA, 1988L), yearalter2_17 = c(0L, NA, 0L), histdist_17 = c("0", "", "Expanded Carnegie Hill Historic District" ), landmark_17 = c(NA, "", NA), builtfar_17 = c(NA, "", "10.83" ), residfar_17 = c(4.77, NA, 10), commfar_17 = c(3.44, NA, 0),
facilfar_17 = c(0, NA, 10), borocode_17 = c(6.5, NA, 1),
bbl_17 = c(1, NA, 1016020001), condono_17 = c(1016020058L,
NA, 0L), tract2010_17 = c(0L, NA, 16002L), xcoord_17 = c(16002L,
NA, 996653L), ycoord_17 = c(996984L, NA, 226361L), zonemap_17 = c("226326",
"", "6b"), zmcode_17 = c("6b", "", NA), sanborn_17 = c(NA,
"", "108S043"), taxmap_17 = c("108S043", "", "10602"), edesignum_17 = c("10602",
"", NA), appbbl_17 = c(NA, "", "0"), appdate_17 = c("0",
"", NA), plutomapid_17 = c(NA, "", "1"), version_17 = c("1",
"", "17V1"), mappluto_f_17 = c("17V1", "", "0"), shape_leng_17 = c("0",
"", "514.691112825"), shape_area_17 = c(285.694168916, NA,
15970.2343838), sfha_07_17 = c("4794.40305456", "", "0"),
sfha_15_17 = c(0, NA, 0)), .Names = c("borough_17", "block_17", "lot_17", "cd_17", "ct2010_17", "cb2010_17", "schooldist_17", "council_17", "zipcode_17", "firecomp_17", "policeprct_17", "healthcent_17", "healtharea_17", "sanitboro_17", "sanitdistr_17", "sanitsub_17", "address_17", "zonedist1_17", "zonedist2_17", "zonedist3_17", "zonedist4_17", "overlay1_17", "overlay2_17", "spdist1_17", "spdist2_17", "spdist3_17", "ltdheight_17", "splitzone_17", "bldgclass_17", "landuse_17", "easements_17", "ownertype_17", "ownername_17", "lotarea_17", "bldgarea_17", "comarea_17", "resarea_17", "officearea_17", "retailarea_17", "garagearea_17", "strgearea_17", "factryarea_17", "otherarea_17", "areasource_17", "numbldgs_17", "numfloors_17", "unitsres_17", "unitstotal_17", "lotfront_17", "lotdepth_17", "bldgfront_17", "bldgdepth_17", "ext_17", "proxcode_17", "irrlotcode_17", "lottype_17", "bsmtcode_17", "assessland_17", "assesstot_17", "exemptland_17", "exempttot_17", "yearbuilt_17", "yearalter1_17", "yearalter2_17", "histdist_17", "landmark_17", "builtfar_17", "residfar_17", "commfar_17", "facilfar_17", "borocode_17", "bbl_17", "condono_17", "tract2010_17", "xcoord_17", "ycoord_17", "zonemap_17", "zmcode_17", "sanborn_17", "taxmap_17", "edesignum_17", "appbbl_17", "appdate_17", "plutomapid_17", "version_17", "mappluto_f_17", "shape_leng_17", "shape_area_17", "sfha_07_17", "sfha_15_17"), row.names = 13:15, class = "data.frame")
I studied my dataframe and found out that something seems to be wrong at the field "lotarea_17". In this field, the first record is a string of many values. commas have not been separated and many columns have remained in a single one (image below shows the dataframe as it looks):
A look at the row on top of the NA-filled row
I have checked in the .csv file, and I have realized that the problem takes place when there is a \ symbol in a specific column of the preceeding row to the one that gets filled with NAs. Probably, this symbol is cancelling one of the " signs, leading to the whole row be taken as a string. Below, I show two images of a .csv file, one of a dataset with the described issue and another one that worked well and had the same record, but written differently. It seems like the coma separated values of the row that gives problems says:
...,"\"1392 ASSOCIATES, LLC",4892,20488,1300,19188,...
While the one that worked shows:
"...,1392 ASSOCIATES, LLC",4892,20488,1300,19188,...
csv with the problem A csv file with the same row, that worked when importing it
Hence, my question is about whether there could be a quick fix, either in R or Excel, (preferably in R at the read.csv function), to automatically get over this issue in which fragment "\" is mssing up with the structure of my data. Manual fix is not an option, since we are talking about more than 200 cases in only one df (and i have a few more with this issue).
Addition - the .csv files were created by writing the data frames of several sf objects. The command line was `write.table(as.data.frame(sf_object), filename, sep = ",", row.names = F. The sf objects were created by spatially joining the shapefiles of MapPLUTO (NYC's land use maps database) and another shapefile (but I know that there is no problem coming from that join because the columns inherited from there are perfect).