3

I have database outputs like the following:

$row = '(one,"pika chu",,"")'

If I send this string as parameter to str_getcsv it will output ['one', 'pika chu', '', '']. The third element despite being absent has been turned into an empty string. This is very annoying since I must recognize empty values (no values) from empty strings. The output I would expect is ['one', 'pika chu', null, ''].

The inputs I get are from a PostgreSQL database and are represented as composite values.

By example, if a table is pokemon_id => int4, name => text then a query will output strings like '(1, "pika chu")'. A unique constraint on the name field by example will allow the following two records to exist: (100, '') and (101, null).

When fetched, they are formatted as raw values like: '98,whatever' '99,"pika chu"' '100,""' '101,' '102,","'

I need to read those strings and this example must output the following arrays: ['98', 'whatever'] ['99', 'pika chu'] ['100', ''] ['101', null] ['102', ',']

Is there a way to do that in PHP?

Update 1: @deceze kindly sent me this link stating there are no NULLs in CSV (TL;DR because there were no nulls in XML basically, this problem has been tackled since then.) How to parse CSV with NULLs then?

Update 2: I had propositions to create a dedicated parser in PHP using preg_match_* functions. I am a bit reluctant to go that way because 1) of the performance impact compared to str_getcsv and 2) the fact preg_match used to segfault if the string passed was over 8kb (which can happen in a CSV context).

Update 3: I looked at str_getcsvsource code to see if it was possible to propose a patch to add parsing options like it is in some other languages. I now understand PHP’s underlying philosophy better. @daniel-vérité raised the idea to implement a state machine to parse CSV strings. Even though input can have thousands of lines that weight dozens of kilobytes with embedded CSV structures, it might be the best way.

Thank you for your help.

Community
  • 1
  • 1
greg
  • 3,354
  • 1
  • 24
  • 35
  • 2
    CSV doesn't have any type information, there isn't even any distinction between a string and a number. No value and an empty string mean the same thing in CSV. If you want to differentiate those, you'll need to write your own CSV parser. – deceze Feb 20 '17 at 14:38
  • 1
    Or assign special values instead of nulls (-1 for example) before sending them as a parameter – Milan Markovic Feb 20 '17 at 14:41
  • It is not about types, it’s just about value and no value. CSV inputs are always text by definition. – greg Feb 20 '17 at 14:41
  • Thanks @MilanMarkovic I will keep this idea in mind. – greg Feb 20 '17 at 14:43
  • 1
    It *is* about types. "No value" and "empty string" are indistinguishable in CSV because the only type it has are strings. Since "no value" isn't a representable type in CSV you cannot differentiate it from "empty string". – deceze Feb 20 '17 at 14:46
  • `SELECT null::text; -- <-- returns NULL` no value typed as text. I did actually represent the absence of value in my CSV input in my question therefor it *is* possible to distinguish an empty string from a missing value in a CSV string. It may not be possible in PHP but do not tell me it is not possible because it is a string. A missing (or unknown) string is not an empty string. – greg Feb 20 '17 at 14:55
  • Addendum to my comment above: `select NULL::text = ''::text; -- returns false` Using a string as input for composite types uses the same rule `insert into pika (chu) values ('(1,,"")','(2,"",)')` when the field chu is a composite type. This is the kind of output I need to turn into PHP arrays. – greg Feb 20 '17 at 15:15
  • Comparing an SQL dialect to CSV isn't very productive. In CSV, `foo,bar` and `"foo","bar"` are exactly equivalent. The quotes don't mean anything, they're only used to disambiguate delimiters. `foo` is a string, and `"foo"` is a string too. Why then should `,` *not* be a string (two empty strings to be exact)? – deceze Feb 20 '17 at 15:21
  • Since quotes are used to disambiguate delimiters, `",","",` should output `[',', '', null]` (3 elements). I have updated my question based on your comments so it will hopefully make more sense. – greg Feb 20 '17 at 15:31
  • 1
    It "should" only if you redefine how CSV works and/or specify your own flavour of CSV. Standard CSV *has no `null`*. See http://www.garretwilson.com/blog/2009/04/23/csvnull.xhtml, http://stackoverflow.com/a/5968530/476 – deceze Feb 20 '17 at 15:38
  • Your answer is correct (thank you for the links) and since `str_getcsv('')` returns `[null]` there is hope… – greg Feb 20 '17 at 16:02
  • 2
    `str_getcsv` supposedly conforms to [RFC4180](https://www.ietf.org/rfc/rfc4180.txt), and the distinction you want does not conform to this RFC. Like @deceze, I think you need to write your own parser. It's about 100 lines of code for a hand-made finite state machine that implements the full spec. As for the performance in pure php, well...you'll see :) – Daniel Vérité Feb 21 '17 at 15:13
  • @DanielVérité thank you. RFC4180 quotes _"This memo provides information for the Internet community. It does not specify an Internet standard of any kind."_ It does describe how www application type "text/csv" works and make it compatible with Ms/Excel… that’s unfortunate. I think we have no choice but to rebuild the wheel with a performance (and maybe bugs) impact. – greg Feb 21 '17 at 15:22

0 Answers0