0

I'm reading a csv file into R using fread from data.table. The reading process stopped with this error:

fstDF <- fread("dat.csv")
Read 34.5% of 2000004 rowsError in fread("dat.csv") : 
  Expected sep (',') but new line or EOF ends field 25 on line 800747 when reading data: John,,,ID,362526197318501X,M,19730218, ,,F,,CHN,44,4403,,,,,,,13828890538,,, ,M

I checked the data and found that the error was caused by new line character in the field which breaks one line of record into two lines. Just like the line start with Steve in the below sample data:

Name,CardNo,Descriot,CtfTp,CtfId,Gender,Birthday,Address,Zip,Dirty,District1,District2,District3,District4,District5,District
6,FirstNm,LastNm,Duty,Mobile,Tel,Fax,EMail,Nation,Taste,Education,Company,CTel,CAddress,CZip,Family,Version,id              
Mike,,,OTH,010-116321,M,19000101,,100080, ,,CHN,0,0,,,,,,10116,010-82808028,010-82828028-208,chenmeng@dist.
Steve,,,GID,0282,M,19000101,,051430, ,,CHN
,0,0,,,,,,13831193762,0311-88030066,0311-88030088,info@shineway.com,,,,,   
Nicholas,,,OTH,010-125321,F,19000101,,100097,,,CHN,0,0,,,,,,10125,010-88400202,010-88400260,,,,,,,,,,,4
Abrham,,,OTH,010-130321,F,19000101,,100029,,,CHN,0,0,,,,,,10130,010-51292052/3-802,010-51292052/3-811,
Bill,,,OTH,010-142321,F,19000101,,100007,,,CHN,0,0,,,,,,10142,010-67687044,010-67687044,baiguoshouyue@sina       
Zabrina,,,OTH,010-186321,F,19000101,,100101,,,CHN,0,0,,,,,,13942697025,010-64869596/0411-668895950,0411-6688519
Julia,,,OTH,021-044321,M,19000101,,201206,,,CHN,0,0,,,,,,21044,021-28995000*208,021-50315077,jane.dai@parker.com  
Dave,,,OTH,021-127321,M,19000101,,200008,,,CHN,0,0,,,,,,21127,021-55150244,021-55150344,,,,,,,,,,,9     
Cecilia,,,OTH,021-151321,F,19000101,,201108,,,CHN,0,0,,,,,,21151,021-61451188,021-61452602,reception.china@eurotherm.co

This data was exported from Microsoft SQL Server. I can't access to the database, I don't know what's wrong with the exporting process. But surely I know that it is the wrongly new line character caused reading problem.

Here is a similar question on stackoverflow (without a clear solution): Importing csv file to R new line issue

Question:

How to read csv data with new line character?

Community
  • 1
  • 1
Nick
  • 8,451
  • 13
  • 57
  • 106

1 Answers1

1

Step 1: Remove ^M in the middle of line:

perl -pe 's/\r(?!\n)//g'

Ref: How to remove carriage returns in the middle of a line

Step 2: Substitute \n, with , (See @jimmij 's answer below.)

perl -p00e 's/\n,/,/g' 

Ref: https://unix.stackexchange.com/questions/222049/how-to-detect-and-remove-newline-character-within-a-column-in-a-csv-file/222052#222052

Step 3: Read it as usual in fread:

fstDT <- fread("dat.csv")

Read 1999993 rows and 33 (of 33) columns from 0.324 GB file in 00:00:49
Community
  • 1
  • 1
Nick
  • 8,451
  • 13
  • 57
  • 106