0

I have a Typescript file that takes a csv file and splits it using the following code:

var cells = rows[i].split(",");

I now need to fix this so that any comma that's inside quotes does not result in a split. For example, The,"quick, brown fox", jumped should split into The, quick, brown fox, and jumped instead of also splitting quick and brown fox. What is the proper way to do this?

Matt123
  • 556
  • 3
  • 15
  • 36
  • 1
    Here is the answer you're looking for: https://stackoverflow.com/a/11457952/11945488 – Anton Bks Aug 20 '19 at 15:14
  • 1
    I'd strongly encourage to use CSV parser if you need to parse CSV. Doing it with regexps is prone corner cases like the one you described. – mrzasa Aug 20 '19 at 15:14
  • @Anton Bks So `var cells = rowsTemp[i].split(/(".*?"|[^",\s]+)(?=\s*,|\s*$)/g);` ? That isn't working, it's including commas. So it's splitting `The, quick` into `The`, `,`, and `quick`. – Matt123 Aug 20 '19 at 15:20
  • I think you need to use str.match if you're opting for that method, however there is various solutions. I suggest checking out all of them and seeing which applies best to your case – Anton Bks Aug 20 '19 at 15:24
  • @Christopher Peisert None of the solutions in that question seem to work for what I'm trying – Matt123 Aug 20 '19 at 15:37

1 Answers1

5

Update:

I think the final version in a line should be:

var cells = (rows[i] + ',').split(/(?: *?([^",]+?) *?,|" *?(.+?)" *?,|( *?),)/).slice(1).reduce((a, b) => (a.length > 0 && a[a.length - 1].length < 4) ? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]] : [...a, [b]], []).map(e => e.reduce((a, b) => a !== undefined ? a : b, undefined))

or put it more beautifully:

var cells = (rows[i] + ',')
  .split(/(?: *?([^",]+?) *?,|" *?(.+?)" *?,|( *?),)/)
  .slice(1)
  .reduce(
    (a, b) => (a.length > 0 && a[a.length - 1].length < 4)
      ? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]]
      : [...a, [b]],
    [],
  )
  .map(
    e => e.reduce(
      (a, b) => a !== undefined ? a : b, undefined,
    ),
  )
;

This is rather long, but still looks purely functional. Let me explain it:

First, the regular expression part. Basically, a segment you want may fall into 3 possibilities:

  1. *?([^",]+?) *?,, which is a string without " or , surrounded with spaces, followed by a ,.
  2. " *?(.+?)" *?,, which is a string, surrounded with a pair of quotes and an indefinite number of spaces beyond the quotes, followed by a ,.
  3. ( *?),, which is an indefinite number of spaces, followed by a ','.

So splitting by a non-capturing group of a union of these three will basically get us to the answer.

Recall that when splitting with a regular expression, the resulting array consists of:

  1. Strings separated by the separator (the regular expression)
  2. All the capturing groups in the separator

In our case, the separators fill the whole string, so the strings separated are all empty strings, except that last desired part, which is left out because there is no , following it. Thus the resulting array should be like:

  1. An empty string
  2. Three strings, representing the three capturing groups of the first separator matched
  3. An empty string
  4. Three strings, representing the three capturing groups of the second separator matched
  5. ...
  6. An empty string
  7. The last desired part, left alone

So why simply adding a , at the end so that we can get a perfect pattern? This is how (rows[i] + ',') comes about.

In this case the resulting array becomes capturing groups separated by empty strings. Removing the first empty string, they will appear in a group of 4 as [ 1st capturing group, 2nd capturing group, 3rd capturing group, empty string ].

What the reduce block does is exactly grouping them into groups of 4:

  .reduce(
    (a, b) => (a.length > 0 && a[a.length - 1].length < 4)
      ? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]]
      : [...a, [b]],
    [],
  )

And finally, find the first non-undefined elements (an unmatched capturing group will appear as undefined. Our three patterns are exclusive in that any 2 of them cannot be matched simultaneously. So there is exactly 1 such element in each group) in each group which are precisely the desired parts:

  .map(
    e => e.reduce(
      (a, b) => a !== undefined ? a : b, undefined,
    ),
  )

This completes the solution.


I think the following should suffice:

var cells = rows[i].split(/([^",]+?|".+?") *, */).filter(e => e)

or if you don't want the quotes:

var cells = rows[i].split(/(?:([^",]+?)|"(.+?)") *, */).filter(e => e)
Colliot
  • 1,522
  • 3
  • 16
  • 29