128

I have the following type of string

var string = "'string, duppi, du', 23, lala"

I want to split the string into an array on each comma, but only the commas outside the single quotation marks.

I can't figure out the right regular expression for the split...

string.split(/,/)

will give me

["'string", " duppi", " du'", " 23", " lala"]

but the result should be:

["string, duppi, du", "23", "lala"]

Is there a cross-browser solution?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Björn
  • 12,587
  • 12
  • 51
  • 70
  • Is it always single-quotes? Is there ever a single-quote inside a quoted string? If so, how is it escaped (backslash, doubled-up)? – Phrogz Dec 13 '11 at 17:17
  • What if the quote characters are completely interchangeable between double and single quote characters as in JavaScript and HTML/XML code? If so then this requires a more extensive parsing operation that CSV. – austincheney Dec 13 '11 at 17:21
  • actually yes, there could be a single quote inside, escaping with backslash would be fine. – Björn Dec 13 '11 at 17:23
  • Can a value be a double quoted string? – ridgerunner Dec 13 '11 at 17:36
  • Do you need to process empty values? e.g. `var string = "'du',23,,,lala"` – ridgerunner Dec 13 '11 at 19:52
  • Here's a [JavaScript function that parses CSV data, accounting for commas found inside quotes](http://stackoverflow.com/questions/7431268/how-read-data-from-csv-file-using-javascript/22850815#22850815) – curran Apr 03 '14 at 23:28
  • 1
    Papa Parse does a fine job. Parsing a Local CSV File with JavaScript and Papa Parse: http://www.joyofdata.de/blog/parsing-local-csv-file-with-javascript-papa-parse/ – Raffael Aug 11 '14 at 06:10

21 Answers21

256

Disclaimer

2014-12-01 Update: The answer below works only for one very specific format of CSV. As correctly pointed out by DG in the comments, this solution does NOT fit the RFC 4180 definition of CSV and it also does NOT fit MS Excel format. This solution simply demonstrates how one can parse one (non-standard) CSV line of input which contains a mix of string types, where the strings may contain escaped quotes and commas.

A non-standard CSV solution

As austincheney correctly points out, you really need to parse the string from start to finish if you wish to properly handle quoted strings that may contain escaped characters. Also, the OP does not clearly define what a "CSV string" really is. First we must define what constitutes a valid CSV string and its individual values.

Given: "CSV String" Definition

For the purpose of this discussion, a "CSV string" consists of zero or more values, where multiple values are separated by a comma. Each value may consist of:

  1. A double quoted string. (may contain unescaped single quotes.)
  2. A single quoted string. (may contain unescaped double quotes.)
  3. A non-quoted string. (may NOT contain quotes, commas or backslashes.)
  4. An empty value. (An all whitespace value is considered empty.)

Rules/Notes:

  • Quoted values may contain commas.
  • Quoted values may contain escaped-anything, e.g. 'that\'s cool'.
  • Values containing quotes, commas, or backslashes must be quoted.
  • Values containing leading or trailing whitespace must be quoted.
  • The backslash is removed from all: \' in single quoted values.
  • The backslash is removed from all: \" in double quoted values.
  • Non-quoted strings are trimmed of any leading and trailing spaces.
  • The comma separator may have adjacent whitespace (which is ignored).

Find:

A JavaScript function which converts a valid CSV string (as defined above) into an array of string values.

Solution:

The regular expressions used by this solution are complex. And (IMHO) all non-trivial regexes should be presented in free-spacing mode with lots of comments and indentation. Unfortunately, JavaScript does not allow free-spacing mode. Thus, the regular expressions implemented by this solution are first presented in native regex syntax (expressed using Python's handy: r'''...''' raw-multi-line-string syntax).

First here is a regular expression which validates that a CVS string meets the above requirements:

Regex to validate a "CSV string":

re_valid = r"""
# Validate a CSV string having single, double or un-quoted values.
^                                   # Anchor to start of string.
\s*                                 # Allow whitespace before value.
(?:                                 # Group for value alternatives.
  '[^'\\]*(?:\\[\S\s][^'\\]*)*'     # Either Single quoted string,
| "[^"\\]*(?:\\[\S\s][^"\\]*)*"     # or Double quoted string,
| [^,'"\s\\]*(?:\s+[^,'"\s\\]+)*    # or Non-comma, non-quote stuff.
)                                   # End group of value alternatives.
\s*                                 # Allow whitespace after value.
(?:                                 # Zero or more additional values
  ,                                 # Values separated by a comma.
  \s*                               # Allow whitespace before value.
  (?:                               # Group for value alternatives.
    '[^'\\]*(?:\\[\S\s][^'\\]*)*'   # Either Single quoted string,
  | "[^"\\]*(?:\\[\S\s][^"\\]*)*"   # or Double quoted string,
  | [^,'"\s\\]*(?:\s+[^,'"\s\\]+)*  # or Non-comma, non-quote stuff.
  )                                 # End group of value alternatives.
  \s*                               # Allow whitespace after value.
)*                                  # Zero or more additional values
$                                   # Anchor to end of string.
"""

If a string matches the above regex, then that string is a valid CSV string (according to the rules previously stated) and may be parsed using the following regex. The following regex is then used to match one value from the CSV string. It is applied repeatedly until no more matches are found (and all values have been parsed).

Regex to parse one value from valid CSV string:

re_value = r"""
# Match one value in valid CSV string.
(?!\s*$)                            # Don't match empty last value.
\s*                                 # Strip whitespace before value.
(?:                                 # Group for value alternatives.
  '([^'\\]*(?:\\[\S\s][^'\\]*)*)'   # Either $1: Single quoted string,
| "([^"\\]*(?:\\[\S\s][^"\\]*)*)"   # or $2: Double quoted string,
| ([^,'"\s\\]*(?:\s+[^,'"\s\\]+)*)  # or $3: Non-comma, non-quote stuff.
)                                   # End group of value alternatives.
\s*                                 # Strip whitespace after value.
(?:,|$)                             # Field ends on comma or EOS.
"""

Note that there is one special case value that this regex does not match - the very last value when that value is empty. This special "empty last value" case is tested for and handled by the js function which follows.

Example input and output:

In the following examples, curly braces are used to delimit the {result strings}. (This is to help visualize leading/trailing spaces and zero-length strings.)

    // Return array of string values, or NULL if CSV string not well formed.
    function CSVtoArray(text) {
        var re_valid = /^\s*(?:'[^'\\]*(?:\\[\S\s][^'\\]*)*'|"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^,'"\s\\]*(?:\s+[^,'"\s\\]+)*)\s*(?:,\s*(?:'[^'\\]*(?:\\[\S\s][^'\\]*)*'|"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^,'"\s\\]*(?:\s+[^,'"\s\\]+)*)\s*)*$/;
        var re_value = /(?!\s*$)\s*(?:'([^'\\]*(?:\\[\S\s][^'\\]*)*)'|"([^"\\]*(?:\\[\S\s][^"\\]*)*)"|([^,'"\s\\]*(?:\s+[^,'"\s\\]+)*))\s*(?:,|$)/g;
        // Return NULL if input string is not well formed CSV string.
        if (!re_valid.test(text)) return null;
        var a = [];                     // Initialize array to receive values.
        text.replace(re_value, // "Walk" the string using replace with callback.
            function(m0, m1, m2, m3) {
                // Remove backslash from \' in single quoted values.
                if      (m1 !== undefined) a.push(m1.replace(/\\'/g, "'"));
                // Remove backslash from \" in double quoted values.
                else if (m2 !== undefined) a.push(m2.replace(/\\"/g, '"'));
                else if (m3 !== undefined) a.push(m3);
                return ''; // Return empty string.
            });
        // Handle special case of empty last value.
        if (/,\s*$/.test(text)) a.push('');
        return a;
    };

    console.log('Test 1: Test string from original question.');
    console.log(CSVtoArray("'string, duppi, du', 23, lala"));

    console.log('Test 2: Empty CSV string.');
    console.log(CSVtoArray(""));

    console.log('Test 3: CSV string with two empty values.');
    console.log(CSVtoArray(","));

    console.log('Test 4: Double quoted CSV string having single quoted values.');
    console.log(CSVtoArray("'one','two with escaped \' single quote', 'three, with, commas'"));

    console.log('Test 5: Single quoted CSV string having double quoted values.');
    console.log(CSVtoArray('"one","two with escaped \" double quote", "three, with, commas"'));

    console.log('Test 6: CSV string with whitespace in and around empty and non-empty values.');
    console.log(CSVtoArray("   one  ,  'two'  ,  , ' four' ,, 'six ', ' seven ' ,  "));

    console.log('Test 7: Not valid');
    console.log(CSVtoArray("one, that's me!, escaped \, comma"));

Additional notes:

This solution requires that the CSV string be "valid". For example, unquoted values may not contain backslashes or quotes, e.g. the following CSV string is NOT valid:

var invalid1 = "one, that's me!, escaped \, comma"

This is not really a limitation because any sub-string may be represented as either a single or double quoted value. Note also that this solution represents only one possible definition for: "Comma Separated Values".

Edit: 2014-05-19: Added disclaimer. Edit: 2014-12-01: Moved disclaimer to top.

Community
  • 1
  • 1
ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • How hard would it be to adapt this to specify an alternative separator form (ex semicolon, pipe, etc...)? – Evan Plaice Apr 19 '12 at 17:41
  • 1
    @Evan Plaice Sure you can use any separator. Just replace every comma in my regex with the separator of choice (but the separator cannot be whitespace). – ridgerunner Apr 20 '12 at 04:18
  • @Alan Moore - You are, of course, correct. Although the main question specifically assumes that the input consists of only one line, (in which case there should be no newlines in unquoted values), it would be better to make your recommended change. Unfortunately, I've no time to make the change myself right now. – ridgerunner Jun 02 '16 at 02:45
  • thanks for this answer. question. is there an easy to way to make the delimiter dynamic? i.e. i often use tab or semi-colon delimiters – Zach Smith Oct 06 '17 at 08:41
  • @Zach Smith - Easy to make dynamic? No. You need to replace all instances of the comma with the desired delimiter in both regexes (it appears 5 times in `re_valid`and 3 times in `re_value`). That said, it _could_ be done by building the regexes using the `RegExp()` constructor instead of a `/RegExp literal/`. – ridgerunner Oct 08 '17 at 11:59
  • I have found the library `csv-parse/lib/sync` to achieve what I wanted (i.e. splitting a line that is formatted as a CSV) – Zach Smith Oct 08 '17 at 12:11
  • This solution is not work for this pattern: 'I am - "James, James Bond" ' - will return null – Dimon Feb 11 '19 at 10:22
  • This string returns null from CSVtoArray(text): `var text = '512,"""Fake News"" and the ""Best Way"" to deal with A, B, and C", 1/18/2019,media';`. I would expect 4 elements: `512` and `"Fake News" and the "Best Way" to deal with A, B, and C` and `1/18/2019` and `media`. – Doug Lerner Feb 25 '19 at 08:15
  • @ridgerunner: Thanks a lot of this solution. Though I've one question: How this can be parsed: **'Val, 1', Val 2, "Val, 3", \`Val, 4\`** i,e the comma is inside single quote, double quote and back-quote should be treated as a single string. I know that the regex should be modified to support a back-quote as well but couldn't achieve that. Expected value after parsing **["Val, 1", "Val 2", "Val, 3", "Val, 4"]** – Saurav Kumar Dec 30 '20 at 15:13
  • 1
    I don't think using regexp is the best way to parse csv data. I tried it but some of my data files were parsed as just `null` because they use `"` for quotes and some values contain a single quote `'`. It also doesn't account for line breaks and rows. I know I can expand on the code to include these use-cases but that would result in some unwieldy large expression which becomes extremely time consuming. The answer on this page by @Bachor doesn't use regexp, it's easier to maintain _and_ it's twice as fast. – BdR Aug 23 '21 at 11:17
77

RFC 4180 solution

This does not solve the string in the question since its format is not conforming with RFC 4180; the acceptable encoding is escaping double quote with double quote. The solution below works correctly with CSV files d/l from google spreadsheets.

UPDATE (3/2017)

Parsing single line would be wrong. According to RFC 4180 fields may contain CRLF which will cause any line reader to break the CSV file. Here is an updated version that parses CSV string:

'use strict';

function csvToArray(text) {
    let p = '', row = [''], ret = [row], i = 0, r = 0, s = !0, l;
    for (l of text) {
        if ('"' === l) {
            if (s && l === p) row[i] += l;
            s = !s;
        } else if (',' === l && s) l = row[++i] = '';
        else if ('\n' === l && s) {
            if ('\r' === p) row[i] = row[i].slice(0, -1);
            row = ret[++r] = [l = '']; i = 0;
        } else row[i] += l;
        p = l;
    }
    return ret;
};

let test = '"one","two with escaped """" double quotes""","three, with, commas",four with no quotes,"five with CRLF\r\n"\r\n"2nd line one","two with escaped """" double quotes""","three, with, commas",four with no quotes,"five with CRLF\r\n"';
console.log(csvToArray(test));

OLD ANSWER

(Single line solution)

function CSVtoArray(text) {
    let ret = [''], i = 0, p = '', s = true;
    for (let l in text) {
        l = text[l];
        if ('"' === l) {
            s = !s;
            if ('"' === p) {
                ret[i] += '"';
                l = '-';
            } else if ('' === p)
                l = '-';
        } else if (s && ',' === l)
            l = ret[++i] = '';
        else
            ret[i] += l;
        p = l;
    }
    return ret;
}
let test = '"one","two with escaped """" double quotes""","three, with, commas",four with no quotes,five for fun';
console.log(CSVtoArray(test));

And for the fun, here is how you create CSV from the array:

function arrayToCSV(row) {
    for (let i in row) {
        row[i] = row[i].replace(/"/g, '""');
    }
    return '"' + row.join('","') + '"';
}

let row = [
  "one",
  "two with escaped \" double quote",
  "three, with, commas",
  "four with no quotes (now has)",
  "five for fun"
];
let text = arrayToCSV(row);
console.log(text);
niry
  • 3,238
  • 22
  • 34
12

I liked FakeRainBrigand's answer, however it contains a few problems: It can not handle whitespace between a quote and a comma, and does not support 2 consecutive commas. I tried editing his answer but my edit got rejected by reviewers that apparently did not understand my code. Here is my version of FakeRainBrigand's code. There is also a fiddle: http://jsfiddle.net/xTezm/46/

String.prototype.splitCSV = function() {
        var matches = this.match(/(\s*"[^"]+"\s*|\s*[^,]+|,)(?=,|$)/g);
        for (var n = 0; n < matches.length; ++n) {
            matches[n] = matches[n].trim();
            if (matches[n] == ',') matches[n] = '';
        }
        if (this[0] == ',') matches.unshift("");
        return matches;
}

var string = ',"string, duppi, du" , 23 ,,, "string, duppi, du",dup,"", , lala';
var parsed = string.splitCSV();
alert(parsed.join('|'));
HammerNL
  • 1,689
  • 18
  • 22
  • Thank you, it works for me partially. Does anybody know how to get rid of extra comas, which left after this filter? Here is a text "text,text,,text,,,text", and as a result I have the last coma remaining as a value, so I have three comas as a parsed values. – m.zhelieznov Jan 26 '21 at 17:35
  • @m.zhelieznov you could try running it through another regexp, like: text = text.replace(/,+/,',') – HammerNL Jan 27 '21 at 09:16
  • thank you, I did something similar to what you have suggested. I'm converting .xml to .xlsx using next script: https://stackoverflow.com/a/51094040/3516022 and I simply checking if the value of cell not equal to coma and then add it. – m.zhelieznov Jan 28 '21 at 13:04
7

People seemed to be against RegEx for this. Why?

(\s*'[^']+'|\s*[^,]+)(?=,|$)

Here's the code. I also made a fiddle.

String.prototype.splitCSV = function(sep) {
  var regex = /(\s*'[^']+'|\s*[^,]+)(?=,|$)/g;
  return matches = this.match(regex);    
}

var string = "'string, duppi, du', 23, 'string, duppi, du', lala";

console.log( string.splitCSV()  );
.as-console-wrapper { max-height: 100% !important; top: 0; }
Mister Jojo
  • 20,093
  • 6
  • 21
  • 40
Brigand
  • 84,529
  • 20
  • 165
  • 173
  • 4
    Hmm, your regexp does have some issues: It can not handle whitespace between a quote and a comma, and does not support 2 consecutive commas. I've updated your answer with code that fixes both issues and made a new fiddle: http://jsfiddle.net/xTezm/43/ – HammerNL Nov 02 '16 at 14:48
  • 1
    For some reason my edit to your code was rejected because it would "deviate from the original intend of the post". Very strange!? I just took your code and fixed two problems with it. How does that change the intend of the post!? Anyway... I've simply add a new answer to this question. – HammerNL Nov 18 '16 at 09:03
  • Good question in your answer, @FakeRainBrigand. I for one all for regex, and because of that, I acknowledge that it is the wrong tool for the job. – niry Nov 23 '17 at 03:27
  • 3
    @niry my code here is awful. I promise I've gotten better over the past 6 years :-p – Brigand Nov 23 '17 at 08:36
  • you do nothing for the possible spaces after the csv commas (as in the example data)? – Mister Jojo Mar 13 '21 at 17:00
6

PEG(.js) grammar that handles RFC 4180 examples at http://en.wikipedia.org/wiki/Comma-separated_values:

start
  = [\n\r]* first:line rest:([\n\r]+ data:line { return data; })* [\n\r]* { rest.unshift(first); return rest; }

line
  = first:field rest:("," text:field { return text; })*
    & { return !!first || rest.length; } // ignore blank lines
    { rest.unshift(first); return rest; }

field
  = '"' text:char* '"' { return text.join(''); }
  / text:[^\n\r,]* { return text.join(''); }

char
  = '"' '"' { return '"'; }
  / [^"]

Test at http://jsfiddle.net/knvzk/10 or https://pegjs.org/online.

Download the generated parser at https://gist.github.com/3362830.

Trevor Dixon
  • 23,216
  • 12
  • 72
  • 109
6

I had a very specific use case where I wanted to copy cells from Google Sheets into my web app. Cells could include double-quotes and new-line characters. Using copy and paste, the cells are delimited by a tab characters, and cells with odd data are double quoted. I tried this main solution, the linked article using regexp, and Jquery-CSV, and CSVToArray. http://papaparse.com/ Is the only one that worked out of the box. Copy and paste is seamless with Google Sheets with default auto-detect options.

bjcullinan
  • 71
  • 1
  • 3
  • 1
    This should be ranked much higher, never try to roll your own CSV parser, [it will not work correctly](http://secretgeek.net/csv_trouble) - especially when using regexes. Papaparse is *awesome* -- use it! – cbley Aug 03 '17 at 11:10
4

Adding one more to the list, because I find all of the above not quite "KISS" enough.

This one uses regex to find either commas or newlines while skipping over quoted items. Hopefully this is something noobies can read through on their own. The splitFinder regexp has three things it does (split by a |):

  1. , - finds commas
  2. \r?\n - finds new lines, (potentially with carriage return if the exporter was nice)
  3. "(\\"|[^"])*?" - skips anynthing surrounded in quotes, because commas and newlines don't matter in there. If there is an escaped quote \\" in the quoted item, it will get captured before an end quote can be found.

const splitFinder = /,|\r?\n|"(\\"|[^"])*?"/g;

function csvTo2dArray(parseMe) {
  let currentRow = [];
  const rowsOut = [currentRow];
  let lastIndex = splitFinder.lastIndex = 0;
  
  // add text from lastIndex to before a found newline or comma
  const pushCell = (endIndex) => {
    endIndex = endIndex || parseMe.length;
    const addMe = parseMe.substring(lastIndex, endIndex);
    // remove quotes around the item
    currentRow.push(addMe.replace(/^"|"$/g, ""));
    lastIndex = splitFinder.lastIndex;
  }


  let regexResp;
  // for each regexp match (either comma, newline, or quoted item)
  while (regexResp = splitFinder.exec(parseMe)) {
    const split = regexResp[0];

    // if it's not a quote capture, add an item to the current row
    // (quote captures will be pushed by the newline or comma following)
    if (split.startsWith(`"`) === false) {
      const splitStartIndex = splitFinder.lastIndex - split.length;
      pushCell(splitStartIndex);

      // then start a new row if newline
      const isNewLine = /^\r?\n$/.test(split);
      if (isNewLine) { rowsOut.push(currentRow = []); }
    }
  }
  // make sure to add the trailing text (no commas or newlines after)
  pushCell();
  return rowsOut;
}

const rawCsv = `a,b,c\n"test\r\n","comma, test","\r\n",",",\nsecond,row,ends,with,empty\n"quote\"test"`
const rows = csvTo2dArray(rawCsv);
console.log(rows);
Seph Reed
  • 8,797
  • 11
  • 60
  • 125
  • If i read my file in via fileReader and my result: `Id, Name, Age 1, John Smith, 65 2, Jane Doe, 30` how can i parse based on columns i specify? – bluePearl Apr 17 '19 at 21:56
  • After you get the 2d array, remove the first index (it those are your prop names), then iterate over the rest of the array, creating objects with each of the values as a property. It'll look like this: `[{Id: 1, Name: "John Smith", Age: 65}, {Id: 2, Name: "Jane Doe", Age: 30}]` – Seph Reed Apr 18 '19 at 14:41
4

No regexp, readable, and according to https://en.wikipedia.org/wiki/Comma-separated_values#Basic_rules:

function csv2arr(str: string) {
    let line = ["",];
    const ret = [line,];
    let quote = false;

    for (let i = 0; i < str.length; i++) {
        const cur = str[i];
        const next = str[i + 1];

        if (!quote) {
            const cellIsEmpty = line[line.length - 1].length === 0;
            if (cur === '"' && cellIsEmpty) quote = true;
            else if (cur === ",") line.push("");
            else if (cur === "\r" && next === "\n") { line = ["",]; ret.push(line); i++; }
            else if (cur === "\n" || cur === "\r") { line = ["",]; ret.push(line); }
            else line[line.length - 1] += cur;
        } else {
            if (cur === '"' && next === '"') { line[line.length - 1] += cur; i++; }
            else if (cur === '"') quote = false;
            else line[line.length - 1] += cur;
        }
    }
    return ret;
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Bachor
  • 828
  • 12
  • 17
  • 1
    This should be top answer imho. I tested with a large dataset and this function perfoms about twice as fast compared to the current top answer by @ridgerunner. Plus it returns a 2D array _and_ it doesn't rely on regex which makes it much easier to debug and less finicky about data input. – BdR Aug 23 '21 at 00:05
  • what about supporting a CSV where the first row are the names of the columns? – vsync Feb 28 '22 at 21:14
4

RFC 4180 Solution Using Regular Expression

Regular expressions to the rescue! These few lines of code properly handle quoted fields with embedded commas, quotes, and newlines based on the RFC 4180 standard.

function parseCsv(data, fieldSep, newLine) {
  fieldSep = fieldSep || ',';
  newLine = newLine || '\n';
  const nSep = '\x1D'; const nSepRe = new RegExp(nSep, 'g');
  const qSep = '\x1E'; const qSepRe = new RegExp(qSep, 'g');
  const cSep = '\x1F'; const cSepRe = new RegExp(cSep, 'g');
  const fieldRe = new RegExp('(^|[' + fieldSep + '\\n])"([^"]*(?:""[^"]*)*)"(?=($|[' + fieldSep + '\\n]))', 'g');
  return data
    .replace(/\r/g, '')
    .replace(/\n+$/, '')
    .replace(fieldRe, (match, p1, p2) => {
      return p1 + p2.replace(/\n/g, nSep).replace(/""/g, qSep).replace(/,/g, cSep)
    })
    .split(/\n/)
    .map(line => {
      return line
        .split(fieldSep)
        .map(cell => cell.replace(nSepRe, newLine).replace(qSepRe, '"').replace(cSepRe, ','))
    });
}

const csv = 'A1,B1,C1\n"A ""2""","B, 2","C\n2"';
const separator = ',';      // field separator, default: ','
const newline = ' <br /> '; // newline representation in case a field contains newlines, default: '\n' 
let grid = parseCsv(csv, separator, newline);
// expected: [ [ 'A1', 'B1', 'C1' ], [ 'A "2"', 'B, 2', 'C <br /> 2' ] ]

Notes:

  • The field separator can be configured, such as \t for TSV (tab-separated values)
  • Embedded newlines can be converted to something else, such as <br/> for HTML use
  • This parseCsv function avoids negative/positive lookbehinds, e.g. works also on Safari browsers.

Unless stated elsewhere, you don't need a finite state machine. The regular expression handles RFC 4180 properly thanks to a functional programming approach using temporary replacements/restores, capture groups, and positive lookahead.

Clone/download code at https://github.com/peterthoeny/parse-csv-js

Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

Old Answer Using Lookbehinds

(This does not work on Safari browsers)

function parseCsv(data, fieldSep, newLine) {
    fieldSep = fieldSep || ',';
    newLine = newLine || '\n';
    var nSep = '\x1D';
    var qSep = '\x1E';
    var cSep = '\x1F';
    var nSepRe = new RegExp(nSep, 'g');
    var qSepRe = new RegExp(qSep, 'g');
    var cSepRe = new RegExp(cSep, 'g');
    var fieldRe = new RegExp('(?<=(^|[' + fieldSep + '\\n]))"(|[\\s\\S]+?(?<![^"]"))"(?=($|[' + fieldSep + '\\n]))', 'g');
    var grid = [];
    data.replace(/\r/g, '').replace(/\n+$/, '').replace(fieldRe, function(match, p1, p2) {
        return p2.replace(/\n/g, nSep).replace(/""/g, qSep).replace(/,/g, cSep);
    }).split(/\n/).forEach(function(line) {
        var row = line.split(fieldSep).map(function(cell) {
            return cell.replace(nSepRe, newLine).replace(qSepRe, '"').replace(cSepRe, ',');
        });
        grid.push(row);
    });
    return grid;
}

const csv = 'A1,B1,C1\n"A ""2""","B, 2","C\n2"';
const separator = ',';      // field separator, default: ','
const newline = ' <br /> '; // newline representation in case a field contains newlines, default: '\n' 
var grid = parseCsv(csv, separator, newline);
// expected: [ [ 'A1', 'B1', 'C1' ], [ 'A "2"', 'B, 2', 'C <br /> 2' ] ]
Peter Thoeny
  • 7,379
  • 1
  • 10
  • 20
  • A word of warning: if you need your solution to work in Safari, that browser has a bug and lookbehinds are broken in JS regex. An upcoming release is supposed to fix it but it has been a problem for years. – tobybot Mar 20 '23 at 15:18
  • 1
    Added a new solution that does not use lookbehind, e.g. works on Safari – Peter Thoeny Mar 20 '23 at 21:34
4

I've used regex a number of times, but I always have to relearn it each time, which is frustrating :-)

So Here's a non-regex solution:

function csvRowToArray(row, delimiter = ',', quoteChar = '"'){
    let nStart = 0, nEnd = 0, a=[], nRowLen=row.length, bQuotedValue;
    while (nStart <= nRowLen) {
        bQuotedValue = (row.charAt(nStart) === quoteChar);
        if (bQuotedValue) {
            nStart++;
            nEnd = row.indexOf(quoteChar + delimiter, nStart)
        } else {
            nEnd = row.indexOf(delimiter, nStart)
        }
        if (nEnd < 0) nEnd = nRowLen;
        a.push(row.substring(nStart,nEnd));
        nStart = nEnd + delimiter.length + (bQuotedValue ? 1 : 0)
    }
    return a;
}

How it works:

  1. Pass in the csv string in row.
  2. While the start position of the next value is within the row, do the following:
    • If this value has been quoted, set nEnd to the closing quote.
    • Else if value has NOT been quoted, set nEnd to the next delimiter.
    • Add the value to an array.
    • Set nStart to nEnd plus the length of the delimeter.

Sometimes it's good to write your own small function, rather than use a library. Your own code is going to perform well and use only a small footprint. In addition, you can easily tweak it to suit your own needs.

  • 1
    Thanks @paul ! A small improvement. I replaced `if (nEnd < 0) nEnd = nRowLen` with ```if (nEnd < 0 ) { if (bQuotedValue == true) {nEnd = nRowLen - 1} else {nEnd = nRowLen} }``` to accommodate for rows ending in quotes. I hope that makes sense. – Blueraaga Jan 07 '23 at 13:51
3

I have also faced the same type of problem when I had to parse a CSV file.

The file contains a column address which contains the ',' .

After parsing that CSV file to JSON, I get mismatched mapping of the keys while converting it into a JSON file.

I used Node.js for parsing the file and libraries like baby parse and csvtojson.

Example of file -

address,pincode
foo,baar , 123456

While I was parsing directly without using baby parse in JSON, I was getting:

[{
 address: 'foo',
 pincode: 'baar',
 'field3': '123456'
}]

So I wrote code which removes the comma(,) with any other delimiter with every field:

/*
 csvString(input) = "address, pincode\\nfoo, bar, 123456\\n"
 output = "address, pincode\\nfoo {YOUR DELIMITER} bar, 123455\\n"
*/
const removeComma = function(csvString){
    let delimiter = '|'
    let Baby = require('babyparse')
    let arrRow = Baby.parse(csvString).data;
    /*
      arrRow = [
      [ 'address', 'pincode' ],
      [ 'foo, bar', '123456']
      ]
    */
    return arrRow.map((singleRow, index) => {
        //the data will include
        /*
        singleRow = [ 'address', 'pincode' ]
        */
        return singleRow.map(singleField => {
            //for removing the comma in the feild
            return singleField.split(',').join(delimiter)
        })
    }).reduce((acc, value, key) => {
        acc = acc +(Array.isArray(value) ?
         value.reduce((acc1, val)=> {
            acc1 = acc1+ val + ','
            return acc1
        }, '') : '') + '\n';
        return acc;
    },'')
}

The function returned can be passed into the csvtojson library and thus the result can be used.

const csv = require('csvtojson')

let csvString = "address, pincode\\nfoo, bar, 123456\\n"
let jsonArray = []
modifiedCsvString = removeComma(csvString)
csv()
  .fromString(modifiedCsvString)
  .on('json', json => jsonArray.push(json))
  .on('end', () => {
    /* do any thing with the json Array */
  })

Now you can get the output like:

[{
  address: 'foo, bar',
  pincode: 123456
}]
Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Supermacy
  • 1,429
  • 15
  • 24
3

If you can have your quote delimiter be double quotes, then this is a duplicate of Example JavaScript code to parse CSV data.

You can either translate all single-quotes to double-quotes first:

string = string.replace( /'/g, '"' );

...or you can edit the regex in that question to recognize single-quotes instead of double-quotes:

// Quoted fields.
"(?:'([^']*(?:''[^']*)*)'|" +

However, this assumes certain markup that is not clear from your question. Please clarify what all the various possibilities of markup can be, per my comment on your question.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Phrogz
  • 296,393
  • 112
  • 651
  • 745
2

While reading the CSV file into a string, it contains null values in between strings, so try it with \0 line by line. It works for me.

stringLine = stringLine.replace(/\0/g, "" );
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Sharathi RB
  • 867
  • 2
  • 9
  • 24
2

To complement this answer

If you need to parse quotes escaped with another quote, example:

"some ""value"" that is on xlsx file",123

You can use

function parse(text) {
  const csvExp = /(?!\s*$)\s*(?:'([^'\\]*(?:\\[\S\s][^'\\]*)*)'|"([^"\\]*(?:\\[\S\s][^"\\]*)*)"|"([^""]*(?:"[\S\s][^""]*)*)"|([^,'"\s\\]*(?:\s+[^,'"\s\\]+)*))\s*(?:,|$)/g;

  const values = [];

  text.replace(csvExp, (m0, m1, m2, m3, m4) => {
    if (m1 !== undefined) {
      values.push(m1.replace(/\\'/g, "'"));
    }
    else if (m2 !== undefined) {
      values.push(m2.replace(/\\"/g, '"'));
    }
    else if (m3 !== undefined) {
      values.push(m3.replace(/""/g, '"'));
    }
    else if (m4 !== undefined) {
      values.push(m4);
    }
    return '';
  });

  if (/,\s*$/.test(text)) {
    values.push('');
  }

  return values;
}
Community
  • 1
  • 1
BrunoLM
  • 97,872
  • 84
  • 296
  • 452
2

Try this one.

function parseCSV(csv) {
    let quotes = [];
    let token = /(?:(['"`])([\s\S]*?)\1)|([^\t,\r\n]+)\3?|([\r\n])/gm;
    let text = csv.replace(/\\?(['"`])\1?/gm, s => s.length != 2 ? s : `_r#${quotes.push(s) - 1}`);
    return [...text.matchAll(token)]
        .map(t => (t[2] || t[3] || t[4])
            .replace(/^_r#\d+$/, "")
            .replace(/_r#\d+/g, q => quotes[q.replace(/\D+/, '')][1]))
        .reduce((a, b) => /^[\r\n]$/g.test(b)
            ? a.push([]) && a
            : a[a.length - 1].push(b) && a, [[]])
        .filter(d => d.length);
}
user642122
  • 21
  • 2
2

My answer presumes your input is a reflection of code/content from web sources where single and double quote characters are fully interchangeable provided they occur as an non-escaped matching set.

You cannot use regex for this. You actually have to write a micro parser to analyze the string you wish to split. I will, for the sake of this answer, call the quoted parts of your strings as sub-strings. You need to specifically walk across the string. Consider the following case:

var a = "some sample string with \"double quotes\" and 'single quotes' and some craziness like this: \\\" or \\'",
    b = "sample of code from JavaScript with a regex containing a comma /\,/ that should probably be ignored.";

In this case you have absolutely no idea where a sub-string starts or ends by simply analyzing the input for a character pattern. Instead you have to write logic to make decisions on whether a quote character is used a quote character, is itself unquoted, and that the quote character is not following an escape.

I am not going to write that level of complexity of code for you, but you can look at something I recently wrote that has the pattern you need. This code has nothing to do with commas, but is otherwise a valid enough micro-parser for you to follow in writing your own code. Look into the asifix function of the following application:

https://github.com/austincheney/Pretty-Diff/blob/master/fulljsmin.js

austincheney
  • 1,097
  • 7
  • 8
1

Use the npm library csv-string to parse the strings instead of split: https://www.npmjs.com/package/csv-string

This will handle the comma in quotes and empty entries

  • 2
    Can you please elaborate on why this library can helps others and why this is a better answer then the others? The GitHub page for this library has a few open issues. – HardcoreGamer Aug 06 '21 at 03:45
1

This one is based on niry's answer but for semicolon:

'use strict';

function csvToArray(text) {
    let p = '', row = [''], ret = [row], i = 0, r = 0, s = !0, l;
    for (l of text) {
        if ('"' === l) {
            if (s && l === p) row[i] += l;
            s = !s;
        } else if (';' === l && s) l = row[++i] = '';
        else if ('\n' === l && s) {
            if ('\r' === p) row[i] = row[i].slice(0, -1);
            row = ret[++r] = [l = '']; i = 0;
        } else row[i] += l;
        p = l;
    }
    return ret;
};

let test = '"one";"two with escaped """" double quotes""";"three; with; commas";four with no quotes;"five with CRLF\r\n"\r\n"2nd line one";"two with escaped """" double quotes""";"three, with; commas and semicolons";four with no quotes;"five with CRLF\r\n"';

console.log(csvToArray(test));

Olivier de Jonge
  • 1,454
  • 2
  • 15
  • 31
0

Aside from the excellent and complete answer from ridgerunner, I thought of a very simple workaround for when your backend runs PHP.

Add this PHP file to your domain's backend (say: csv.php)

<?php
    session_start(); // Optional
    header("content-type: text/xml");
    header("charset=UTF-8");
    // Set the delimiter and the End of Line character of your CSV content:
    echo json_encode(array_map('str_getcsv', str_getcsv($_POST["csv"], "\n")));
?>

Now add this function to your JavaScript toolkit (should be revised a bit to make crossbrowser I believe).

function csvToArray(csv) {
    var oXhr = new XMLHttpRequest;
    oXhr.addEventListener("readystatechange",
        function () {
            if (this.readyState == 4 && this.status == 200) {
                console.log(this.responseText);
                console.log(JSON.parse(this.responseText));
            }
        }
    );
    oXhr.open("POST","path/to/csv.php",true);
    oXhr.setRequestHeader("Content-type", "application/x-www-form-urlencoded; charset=utf-8");
    oXhr.send("csv=" + encodeURIComponent(csv));
}

It will cost you one Ajax call, but at least you won't duplicate code nor include any external library.

Ref: http://php.net/manual/en/function.str-getcsv.php

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Sebas
  • 21,192
  • 9
  • 55
  • 109
0

You can use papaparse.js like the example below:

<!DOCTYPE html>
<html lang="en">

    <head>
        <title>CSV</title>
    </head>

    <body>
        <input type="file" id="files" multiple="">
        <button onclick="csvGetter()">CSV Getter</button>
        <h3>The Result will be in the Console.</h3>

        <script src="papaparse.min.js"></script>

        <script>
            function csvGetter() {

                var file = document.getElementById('files').files[0];
                Papa.parse(file, {
                    complete: function(results) {
                        console.log(results.data);
                    }
                });
            }
          </script>
    </body>

</html>

Don't forget to include papaparse.js in the same folder.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Tahseen Alaa
  • 342
  • 4
  • 9
-1

According to this blog post, this function should do it:

String.prototype.splitCSV = function(sep) {
  for (var foo = this.split(sep = sep || ","), x = foo.length - 1, tl; x >= 0; x--) {
    if (foo[x].replace(/'\s+$/, "'").charAt(foo[x].length - 1) == "'") {
      if ((tl = foo[x].replace(/^\s+'/, "'")).length > 1 && tl.charAt(0) == "'") {
        foo[x] = foo[x].replace(/^\s*'|'\s*$/g, '').replace(/''/g, "'");
      } else if (x) {
        foo.splice(x - 1, 2, [foo[x - 1], foo[x]].join(sep));
      } else foo = foo.shift().split(sep).concat(foo);
    } else foo[x].replace(/''/g, "'");
  } return foo;
};

You would call it like so:

var string = "'string, duppi, du', 23, lala";
var parsed = string.splitCSV();
alert(parsed.join("|"));

This jsfiddle kind of works, but it looks like some of the elements have spaces before them.

CanSpice
  • 34,814
  • 10
  • 72
  • 86
  • Imagine having to do all that in a regex. This is why regexes aren't really suitable for parsing sometimes. – CanSpice Dec 13 '11 at 18:08
  • This solution simply does not work. Given the original test string: `"'string, duppi, du', 23, lala"`, this function returns: `["'string"," duppi"," du'"," 23"," lala"]` – ridgerunner Dec 13 '11 at 23:09
  • @ridgerunner: Right you are. I've edited the answer and the jsfiddle to fix the function. Basically, I switched `"'"` to `'"'` and vice-versa. – CanSpice Dec 13 '11 at 23:16
  • That helped, but now the function incorrectly handles single quoted CSV strings having double quoted values. e.g. Reversing the quote types of the original test string like so: `'"string, duppi, du", 23, lala'` results in: `['"string',' duppi'.' du"',' 23',' lala']` – ridgerunner Dec 13 '11 at 23:37
  • @CanSpice, your comment inspired me to try with RegEx. It doesn't have quite as many features, but they could be easily added. (My answer is on this page, if you're interested.) – Brigand Dec 14 '11 at 01:38