I'm an urban planner migrating towards spatial data analysis. I am not oblivious to R and programming in general but since I don't have the proper training my skills are limited sometimes.
At the moment I am trying to analyse about 50 CSV files containing financial data concerning public auctions which are from 60000 to 300000 lines long with 39 fields. The files are exports from the Romanian national public auctioning system, which is a form-like platform.
The issue is that some of the lines are broken by CRLF
line endings in the middle of the address fields. I suspect that when people entered their address in the form they copy/pasted it from other files where it was multiline.
The issue cannot be resolved by Find&Replace as this will also replace the correct CRLF
at the end of the line.
As an example the data is formatted something like this and has a CRLF
after each line(They used ^ as the delimiter):
Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1
Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15;
Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^
In order to properly process the data I would need the CSV to be read like this, by removing only the CRLF
that break lines - which Find&Replace cannot do:
Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1 Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15; Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^
I have found a possible solution (Is there a way in R to join broken lines of csv file?), but it required some tweaking to fit my needs. The end result is that the code below hangs and does not reach the end of the process, even on small sample files.
My alteration of the accepted solution code from the above mentioned post:
dat <- readLines("filename.csv") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], "^", fixed = TRUE)) # extract variable names
nvar <- length(varnames)
k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))
while(k <= length(dat)){
k <- k + 1
if(dat[k] == "") {k <- k + 1
print(paste("data line", k, "is an empty string"))
if(k > length(dat)) {break}
}
temp <- dat[k]
# checks if there are enough commas or if the line was broken
while(length(gregexpr("^", temp)[[1]]) < nvar-1){
k <- k + 1
temp <- paste0(temp, dat[k])
}
temp <- unlist(strsplit(temp, "^"))
message(k)
dat1 <- rbind(dat1, temp)
}
dat1 = dat1[-1,] # delete the empty initial row
Counting fields between delimiters seems like a good solution but I am unable to find a good way to do this and my R programming skills are not enough apparently.
So is there any way to fix this type of broken CSV files in R?
Working files sample can be accessed here: http://data.gv.ro/dataset/4a4903c4-b1e3-46d1-82a5-238287f9496c/resource/c6abc0ef-3efb-4aef-bc0a-411f8cab2a28/download/contracte-2007.csv
Thanks for any help you can give!