1

I've been using the ColdFusion function CSVtoArray successfully to programmatically convert CSV files into Excel files that I can then import into my database via CFSpreadsheet. Recently I ran across a data issue that the function chokes on and I'm not strong enough with RegEx (what I suspect I'll need) to fix it. I'm hoping that someone here can give me an appropriate fix.

The CSV data normally comes in comma delimited and separated by quotes like this (the ellipsis representing additional data before and after the line) which converts without error:

... ,"Minneapolis","MN","55555","Patient Referral - John Smith", ...

The issue I've run into though is when there are extraneous quotation marks within the comma delimited row. For example, this row will fail and stop further import:

... ,"Minneapolis","MN","55555","Patient Referral - Robert "Bob" Smith", ...

I think that what I need to do is run some kind of RegEx Replace on the quote character---but only where there is not a comma before or after it? But I have no idea how to do something like that. Any suggestions?

I'll include the full CSVtoArray function code below that I'm using. I'd appreciate any tips/fixes. Thank you!

    <cffunction
name="csvToArray"
access="public"
returntype="array"
output="false"
hint="I take a CSV file or CSV data value and convert it to an array of arrays based on the given field delimiter. Line delimiter is assumed to be new line / carriage return related.">
 
<!--- Define arguments. --->
<cfargument
name="file"
type="string"
required="false"
default=""
hint="I am the optional file containing the CSV data."
/>
 
<cfargument
name="csv"
type="string"
required="false"
default=""
hint="I am the CSV text data (if the file argument was not used)."
/>
 
<cfargument
name="delimiter"
type="string"
required="false"
default=","
hint="I am the field delimiter (line delimiter is assumed to be new line / carriage return)."
/>
 
<cfargument
name="trim"
type="boolean"
required="false"
default="true"
hint="I flags whether or not to trim the END of the file for line breaks and carriage returns."
/>
 
<!--- Define the local scope. --->
<cfset var local = {} />
 
<!---
Check to see if we are using a CSV File. If so, then all we
want to do is move the file data into the CSV variable. That
way, the rest of the algorithm can be uniform.
--->
<cfif len( arguments.file )>

<cfset FilePathName="\\ply-fileserver\EPDE\DOWNLOADS\#arguments.file#">

<!--- Read the file into Data. --->
<cfset arguments.csv = fileRead( FilePathName ) />
 
</cfif>
 
<!---
ASSERT: At this point, no matter how the data was passed in,
we now have it in the CSV variable.
--->
 
<!---
Check to see if we need to trim the data. Be default, we are
going to pull off any new line and carraige returns that are
at the end of the file (we do NOT want to strip spaces or
tabs as those are field delimiters).
--->
<cfif arguments.trim>
 
<!--- Remove trailing line breaks and carriage returns. --->
<cfset arguments.csv = reReplace(
arguments.csv,
"[\r\n]+$",
"",
"all"
) />
 
</cfif>
 
<!--- Make sure the delimiter is just one character. --->
<cfif (len( arguments.delimiter ) neq 1)>
 
<!--- Set the default delimiter value. --->
<cfset arguments.delimiter = "," />
 
</cfif>
 
 
<!---
Now, let's define the pattern for parsing the CSV data. We
are going to use verbose regular expression since this is a
rather complicated pattern.
 
NOTE: We are using the verbose flag such that we can use
white space in our regex for readability.
--->
<cfsavecontent variable="local.regEx">(?x)
<cfoutput>
 
<!--- Make sure we pick up where we left off. --->
\G
 
<!---
We are going to start off with a field value since
the first thing in our file should be a field (or a
completely empty file).
--->
(?:
 
<!--- Quoted value - GROUP 1 --->
"([^"]*+ (?>""[^"]*+)* )"
 
|
 
<!--- Standard field value - GROUP 2 --->
([^"\#arguments.delimiter#\r\n]*+)
 
)
 
<!--- Delimiter - GROUP 3 --->
(
\#arguments.delimiter# |
\r\n? |
\n |
$
)
 
</cfoutput>
</cfsavecontent>
 
<!---
Create a compiled Java regular expression pattern object
for the experssion that will be parsing the CSV.
--->
<cfset local.pattern = createObject(
"java",
"java.util.regex.Pattern"
).compile(
javaCast( "string", local.regEx )
)
/>
 
<!---
Now, get the pattern matcher for our target text (the CSV
data). This will allows us to iterate over all the tokens
in the CSV data for individual evaluation.
--->
<cfset local.matcher = local.pattern.matcher(
javaCast( "string", arguments.csv )
) />
 
 
<!---
Create an array to hold the CSV data. We are going to create
an array of arrays in which each nested array represents a
row in the CSV data file. We are going to start off the CSV
data with a single row.
 
NOTE: It is impossible to differentiate an empty dataset from
a dataset that has one empty row. As such, we will always
have at least one row in our result.
--->
<cfset local.csvData = [ [] ] />
 
<!---
Here's where the magic is taking place; we are going to use
the Java pattern matcher to iterate over each of the CSV data
fields using the regular expression we defined above.
 
Each match will have at least the field value and possibly an
optional trailing delimiter.
--->
<cfloop condition="local.matcher.find()">
 
<!---
Next, try to get the qualified field value. If the field
was not qualified, this value will be null.
--->
<cfset local.fieldValue = local.matcher.group(
javaCast( "int", 1 )
) />
 
<!---
Check to see if the value exists in the local scope. If
it doesn't exist, then we want the non-qualified field.
If it does exist, then we want to replace any escaped,
embedded quotes.
--->
<cfif structKeyExists( local, "fieldValue" )>
 
<!---
The qualified field was found. Replace escpaed
quotes (two double quotes in a row) with an unescaped
double quote.
--->
<cfset local.fieldValue = replace(
local.fieldValue,
"""""",
"""",
"all"
) />
 
<cfelse>
 
<!---
No qualified field value was found; as such, let's
use the non-qualified field value.
--->
<cfset local.fieldValue = local.matcher.group(
javaCast( "int", 2 )
) />
 
</cfif>
 
<!---
Now that we have our parsed field value, let's add it to
the most recently created CSV row collection.
--->
<cfset arrayAppend(
local.csvData[ arrayLen( local.csvData ) ],
local.fieldValue
) />
 
<!---
Get the delimiter. We know that the delimiter will always
be matched, but in the case that it matched the end of
the CSV string, it will not have a length.
--->
<cfset local.delimiter = local.matcher.group(
javaCast( "int", 3 )
) />
 
<!---
Check to see if we found a delimiter that is not the
field delimiter (end-of-file delimiter will not have
a length). If this is the case, then our delimiter is the
line delimiter. Add a new data array to the CSV
data collection.
--->
<cfif (
len( local.delimiter ) &&
(local.delimiter neq arguments.delimiter)
)>
 
<!--- Start new row data array. --->
<cfset arrayAppend(
local.csvData,
arrayNew( 1 )
) />
 
<!--- Check to see if there is no delimiter length. --->
<cfelseif !len( local.delimiter )>
 
<!---
If our delimiter has no length, it means that we
reached the end of the CSV data. Let's explicitly
break out of the loop otherwise we'll get an extra
empty space.
--->
<cfbreak />
 
</cfif>
 
</cfloop>
 
 
<!---
At this point, our array should contain the parsed contents
of the CSV value as an array of arrays. Return the array.
--->
<cfreturn local.csvData />
</cffunction>
Paul B
  • 11
  • 3
  • 2
    The problem here is not really in `CSVtoArray`, which correctly handles escaped double-quotes. The problem is that whatever is generating the invalid CSV data is broken, because it should be escaping the double quotes within values, from `"` to `""`. Ideally get that fixed, rather than taking on the difficult task of trying to write a parser to parse broken data. – Sev Roberts Jul 09 '20 at 16:43
  • I don't disagree, however, we get these files from multiple external partners so it's unfortunately not usually feasible for us to try and fix their issues. – Paul B Jul 09 '20 at 16:51
  • I did have to do similar once but about 15 years ago - I'd expect anything exporting CSV to be valid by now! I couldn't see a nice regex way to handle it then, I believe my method required that all columns were `"` qualified and was roughly: Replace line's leading+trailing `"` with a reserved chr eg `¬`; replace all escaped `""` with another reserved chr eg `¶`; replace all `","` with another reserved chr, eg `§`; replace any remaining naughty un-escaped `"` also with `¶`; then reverse the replacements. Obviously down to you to choose reserved chrs and ensure they are not present in the data. – Sev Roberts Jul 10 '20 at 11:38

1 Answers1

-1

Tip Number 1

I've been using the ColdFusion function CSVtoArray successfully to programmatically convert CSV files into Excel files that I can then import into my database via CFSpreadsheet. This sounds like overengineering to me. Have you considered reading the file with <cffile> and then looping through the resulting variable to insert your records?

Tip Number 2

You don't need regex to deal with the quotes. You can use simpler string functions as follows:

// delete leading double quote
row = replace(row, '"', '', 'one');
// delete trailing double quote
row = left(row, len(row) -1);
// change csv delimiters to ascii 50, record separator
row = replace(row, '",", chr(50), 'all');

Using query parameters in your insert query will handle any other double or single quotes.

Dan Bracuk
  • 20,699
  • 4
  • 26
  • 43
  • I didn't downvote, but it seems to me that this method would incorrectly handle any correctly-escaped `""` inside text-qualified values by not reducing to `"`, and would fall over and insert a false field separator in the edge-case of a delimiter occurring immediately following and preceding two escaped text-qualifiers, which is not inconceivable if the data contains free text / descriptive prose like `"Patient describes ""shooting"",""burning"",""stabbing"" pains"` – Sev Roberts Jul 10 '20 at 12:12