0

I'm an urban planner migrating towards spatial data analysis. I am not oblivious to R and programming in general but since I don't have the proper training my skills are limited sometimes.

At the moment I am trying to analyse about 50 CSV files containing financial data concerning public auctions which are from 60000 to 300000 lines long with 39 fields. The files are exports from the Romanian national public auctioning system, which is a form-like platform.

The issue is that some of the lines are broken by CRLF line endings in the middle of the address fields. I suspect that when people entered their address in the form they copy/pasted it from other files where it was multiline.

The issue cannot be resolved by Find&Replace as this will also replace the correct CRLF at the end of the line.

As an example the data is formatted something like this and has a CRLF after each line(They used ^ as the delimiter):

Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1
Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15; 
Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^

In order to properly process the data I would need the CSV to be read like this, by removing only the CRLF that break lines - which Find&Replace cannot do:

Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1 Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15; Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^

I have found a possible solution (Is there a way in R to join broken lines of csv file?), but it required some tweaking to fit my needs. The end result is that the code below hangs and does not reach the end of the process, even on small sample files.

My alteration of the accepted solution code from the above mentioned post:

dat <- readLines("filename.csv") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], "^", fixed = TRUE)) # extract variable names
nvar <- length(varnames)

k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))

while(k <= length(dat)){
  k <- k + 1
  if(dat[k] == "") {k <- k + 1
  print(paste("data line", k, "is an empty string"))
  if(k > length(dat)) {break}
  }
  temp <- dat[k]
  # checks if there are enough commas or if the line was broken
  while(length(gregexpr("^", temp)[[1]]) < nvar-1){
    k <- k + 1
    temp <- paste0(temp, dat[k])
  }
  temp <- unlist(strsplit(temp, "^"))
  message(k)
  dat1 <- rbind(dat1, temp)
}

dat1 = dat1[-1,] # delete the empty initial row    

Counting fields between delimiters seems like a good solution but I am unable to find a good way to do this and my R programming skills are not enough apparently.

So is there any way to fix this type of broken CSV files in R?

Working files sample can be accessed here: http://data.gv.ro/dataset/4a4903c4-b1e3-46d1-82a5-238287f9496c/resource/c6abc0ef-3efb-4aef-bc0a-411f8cab2a28/download/contracte-2007.csv

Thanks for any help you can give!

Community
  • 1
  • 1
AFG
  • 13
  • 4

2 Answers2

1

The trouble seems to be that a ^ is a special character. If you step through your code you will see that you have 627 variables instead of 39. It is making each character a variable. Try this:

dat <- readLines("filename.csv") # read whatever is in there, one line at a time
varnames <- unlist(strsplit(dat[1], "\\^"))  # extract variable names
nvar <- length(varnames)

k <- 1 # setting up a counter
dat1 <- matrix(NA, ncol = nvar, dimnames = list(NULL, varnames))

while(k <= length(dat)){
  k <- k + 1
  #if(dat[k] == "") {k <- k + 1
  #print(paste("data line", k, "is an empty string"))
  if(k > length(dat)) {break}
  #}
  temp <- dat[k]
  # checks if there are enough commas or if the line was broken
  while(length(gregexpr("\\^", temp)[[1]]) < nvar-1){
    k <- k + 1
    temp <- paste0(temp, dat[k])
  }
  temp <- unlist(strsplit(temp, "\\^"))
  message(k)
  dat1 <- rbind(dat1, temp)
}

dat1 = dat1[-1,] # delete the empty initial row    

Sorry missed that difference in your code and mine. You don't want fixed=true. changing it to the above gives you this:

> varnames
 [1] "Castigator"                       "CastigatorCUI"                    "CastigatorTara"                  
 [4] "CastigatorLocalitate"             "CastigatorAdresa"                 "Tip"                             
 [7] "TipContract"                      "TipProcedura"                     "AutoritateContractanta"          
[10] "AutoritateContractantaCUI"        "TipAC"                            "TipActivitateAC"                 
[13] "NumarAnuntAtribuire"              "DataAnuntAtribuire"               "TipIncheiereContract"            
[16] "TipCriteriiAtribuire"             "CuLicitatieElectronica"           "NumarOfertePrimite"              
[19] "Subcontractat"                    "NumarContract"                    "DataContract"                    
[22] "TitluContract"                    "Valoare"                          "Moneda"                          
[25] "ValoareRON"                       "ValoareEUR"                       "CPVCodeID"                       
[28] "CPVCode"                          "NumarAnuntParticipare"            "DataAnuntParticipare"            
[31] "ValoareEstimataParticipare"       "MonedaValoareEstimataParticipare" "FonduriComunitare"               
[34] "TipFinantare"                     "TipLegislatieID"                  "FondEuropean"                    
[37] "ContractPeriodic"                 "DepoziteGarantii"                 "ModalitatiFinantare" 
  • Hi Michael, thanks for your response. The solution you posted does not work for me. I updated the data sample to reflect real data. Could you make a quick test to see if the issue is maybe on my machine? It would be much appreciated! – AFG Jul 31 '16 at 05:38
  • The code you suggested does not separate the headers in my sample data by the "^" delimiter. – AFG Jul 31 '16 at 20:22
  • You are correct, the fixed=true should not be there. See fixed code above and output. – Michael Ciesielczyk Jul 31 '16 at 23:28
  • Hi Michael, this seems to do the trick but if you test it on the sample data above do you get `In rbind(dat1, temp) : number of columns of result is not a multiple of vector length (arg 2)` ?I get this on some rows, not all of them. In sample data I get it for the second one and the last field is filled with the value of the first in the same row. – AFG Aug 01 '16 at 05:02
  • I managed to solve the issue. `strsplit` drops the last field of a row if the field is followed by `CRLF` so in order to compensate I modified the `strpsplit` for `temp`: `temp <- unlist(strsplit(paste(temp, "^"),"\\^"))`. – AFG Aug 02 '16 at 12:15
0

We can determine the last row of each record by checking whether it ends in a numeric field. Then using cumsum we can label the rows from the same record using 1, 2, 3, ... . Finally paste them together.

# test data
Lines <- "Name^FiscCode^Country^Adress^SomeData^
SomeCompany^235356^Romania^Adress1
Adress2^ 565863
SomeCompany^235356^Romania^Adress1^ 565863"

# for real problem use readLines("myfile")[-1]
L <- readLines(textConnection(Lines))[-1]

g <- rev(cumsum(rev(grepl("\\^ *\\d+$", L)))) ##
g <- max(g) - g + 1
L2 <- tapply(L, g, paste, collapse = " ")
read.table(text = L2, sep = "^")

The above works for the data shown in the question but if there are differences in the actual data to what you showed then some modifications may be needed depending on what those differences are.

Note: If there are always four ^ characters in each record try replacing the line marked ## with this:

cnt <- count.fields(textConnection(L), sep = "^") - 1
g <- rev(cumsum(rev(cumsum(cnt) %% 4 == 0)))

Update The question has changed to provide new sample data. Note that the answer posted works with it but of course you need to replace 4 with 38 since the new data has 38 delimiters per record whereas the old data had 4. Also the old data had a header and the new data does not so we have removed those occurrences of -1 used to drop the header. Here is a self contained example that can be copied and pasted into R.

Lines <- "Castigator^CastigatorCUI^CastigatorTara^CastigatorLocalitate^CastigatorAdresa^Tip^TipContract^TipProcedura^AutoritateContractanta^AutoritateContractantaCUI^TipAC^TipActivitateAC^NumarAnuntAtribuire^DataAnuntAtribuire^TipIncheiereContract^TipCriteriiAtribuire^CuLicitatieElectronica^NumarOfertePrimite^Subcontractat^NumarContract^DataContract^TitluContract^Valoare^Moneda^ValoareRON^ValoareEUR^CPVCodeID^CPVCode^NumarAnuntParticipare^DataAnuntParticipare^ValoareEstimataParticipare^MonedaValoareEstimataParticipare^FonduriComunitare^TipFinantare^TipLegislatieID^FondEuropean^ContractPeriodic^DepoziteGarantii^ModalitatiFinantare
S.C. RCTHIA CO S.R.L.^65265644^Romania^Bucharest^DN1
Nr. 1, ^Anunt de atribuire la anunt de participare^Furnizare^Licitatie deschisa^COMPANIA NATIONALA DE TRANSPORT AL ENERGIEI ^R656556^^Electricitate^96594^2007-12-14^Un contract de achizitii publice^Pretul cel mai scazut^^1^^61^2007-11-08 00:00:00.000^Televizoare^304503.95^RON^304503.950000000001^89650.5^45937^323124100-1^344578^2007-10-02^49700.00^RON^^^^^^Nu este cazul;^Surse proprii;
ASOC : SC MNG SRLsi SC AquaiM SA ^56565575;656224^Romania^Ploiesti^Str. Independentei nr.15; 
Str. Carol nr. 45^Anunt de atribuire la anunt de participare^Lucrari^Negociere fara anunt de participare^MUNICIPIUL RAMNICU VALCEA^6562655^Administratie publica locala (municipii, orase, comune), institutie publica in subordonarea/coordonarea administratiei publice locale^Servicii generale ale administratiilor publice^56566^2007-10-10^Un contract de achizitii publice^Pretul cel mai scazut^^1^^65656^2007-09-12^Proiectare si executie lucrari^5665560.00^RON^659966.0^5455222^7140^65689966-2^^^^^^^^^^^"

L <- readLines(textConnection(Lines))

cnt <- count.fields(textConnection(L), sep = "^") - 1   # 38 4 34 4 34
g <- rev(cumsum(rev(cumsum(cnt) %% 38 == 0)))
g <- max(g) - g + 1   # 1 2 2 3 3
L2 <- tapply(L, g, paste, collapse = " ")
DF <- read.table(text = L2, sep = "^")
dim(DF)
## [1]  3 39

The sample data does not contain comment characters (#) or single or double quotes but if it did contain these are parts of their data then adding comment.char = "", quote = "" to the count.fields and read.table calls would be needed.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you for your response! Indeed the data is much more varied. The example was just to exemplify the break in the line. Some of the lines miss several attributes and end in and for example end in ^^^^^^^^^, which would have to be empty fields. Also how would the complete rows (without accidental breaks) be handled? The idea I was chasing (inspired by the mentioned post) was to count the fields between delimiters and if the value is smaller than 39 then it would add fields from the following line until it will reach 39 in length. I don't know if I am being clear. – AFG Jul 29 '16 at 13:28
  • Tthe solution can be extended provided you can come up with a condition that is always satisfied by only the last row in each record. For example, if only the last row in each record ends in numeric or an empty field then it would just be a matter of modifying the first `g<-` line in the answer appropriately, i.e. replace the + with a *. – G. Grothendieck Jul 29 '16 at 15:14
  • Also I have added a Note which would work if there is always the same number of fields per record. – G. Grothendieck Jul 29 '16 at 15:24
  • From your note I gather that in order to have R count and add fields until it reaches the number of fields in the header I should have something like this (I have 39 fields): `L <- readLines("testfile.csv")[-1] cnt <- count.fields("testfile.csv", sep = "^") - 1 g <- rev(cumsum(rev(cumsum(cnt) %% 39 == 0))) g <- max(g) - g + 1 L2 <- tapply(L, g, paste, collapse = " ")` The problem with my real data is that I get this error: `Error in tapply(L, g, paste, collapse = " ") : arguments must have same length` – AFG Jul 30 '16 at 11:33
  • If there are 39 fields then there are 38 separators, not 39, so you use 38, ; however, this does assume all rows have 39 fields or less which may or may not be true. You could use `count.fields` to find out how many fields are in each record and if there are not too many bad ones just use the good rows. `cnt <- count.fields("myfile", sep = "^", quote = "")[-1]; sum(cnt == 39); sum(cnt != 39); table(cnt)` – G. Grothendieck Jul 30 '16 at 14:36
  • I believe I might not have been clear in my issue. Some of my rows that should have 39 fields are broken by a `CRLF` (\r\n) and are split into two rows. So not all are 39 fields long. The pattern I could identify from your `count.fields` code above is that the majority of these broken lines are broken in the 5th field, as every line that has 5 fields is followed by another one with 35. What I need to do is to join the two rows into one, and the fifth field on the first row to be joined with the first of the second (the 35 long one) and become one so that the resulting row is 39 fields long. – AFG Jul 30 '16 at 15:35
  • If the sample file provided in the initial question as an example is too large I can provide a smaller sample to test. (?) – AFG Jul 30 '16 at 15:37
  • The code that I posted does handle records split into two or more fields and it does work properly on the data you posted in your question. If it does not read the actual data you have then there are differences between it and what you have described. – G. Grothendieck Jul 30 '16 at 23:57
  • I updated the data sample to reflect the data itself. For the provided sample in the first code block I get the same length argument. The rows are the length after removing the line break by hand as I used your code to verify and it returned 39 for both resulting rows. Like I said, my programming skills are limited and I am definitely missing something here. – AFG Jul 31 '16 at 05:36
  • Note that the code in my answer already works with the new data if you update it appropriately to reflect the changes, i.e. there are now 38 delimiters (39 fields) rather than 4 delimiters (5 fields) per record and there is now no header whereas previously there was one. See section Update at end of answer. – G. Grothendieck Jul 31 '16 at 11:38
  • Your last update clarifies some issues I have been having. I was confused by your comment regarding the `myfile` usage and I was replacing the `textConnection` with the file itself in `cnt`. The code works on my sample data but seems to not work on the large file as `L2` becomes logical for some reason. I suspect `NAs` to be the issue, but in the sample data there are `NAs` as well as blank fields and it works. This seems to be the final hop in resolving my problem. – AFG Jul 31 '16 at 20:24