0

Let's assume I have the following data. This is exported from LibreOffice as CSV, so I assume this is a correct CSV-format. When I import this CSV into LibreOffice again, I will correctly see the multi-line text in the cell.

Data: MultilineLabels.csv

1,Simple,1.3
2,Single line,2.3
3,"Multiline
label",3.3
4,Simple again,4.3
5,Multiline\nlabel,5.3
6,Simple again,6.3

Now, however, if I want to plot this with the following gnuplot script:

Script:

### How to handle CSV and multi-line labels in gnuplot?
reset session

FILE = "MultilineLabels.csv"
set datafile separator comma

set format x "\n"

plot FILE u 1:3:xtic(2) w lp pt 7 lc "red"
### end of script

I get the following output:

Result:

enter image description here

So, the point and label at x=3, i.e. line 3 and line 4 of the CSV are not plotted for obvious reasons: gnuplot simply interprets this as text file and has no special CSV input filter.

In principle, I could use some external tools (or maybe even gnuplot itself) to replace all newlines within matching double quotes by \n.

Would this be the only solution or are there better solutions?

theozh
  • 22,244
  • 5
  • 28
  • 72

2 Answers2

1

There is no formal standard definition of a CSV file. An empirical formalization of CSV files found in the wild was presented in RFC 4180. RFC 4180 states "Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes." It also notes that "however some programs, such as Microsoft Excel, do not use double quotes at all".

Unfortunately there is no guarantee that the exporting program and the importing program agree on how to handle corner cases like embedded newlines. So I fear that confusion about what is or is not a correctly formatted CSV file exists already before you try to read one into gnuplot. Your gnuplot script deals with one possible variant but may well fail on others.

Staying strictly within gnuplot itself, export/import of embedded newlines can be achieved by variant 3 in your test file: "multiline\nlabel", which is at least consistent with RFC 4180.

I tend to agree with you that gnuplot could try harder to interpret an input file like your test case.

Ethan
  • 13,715
  • 2
  • 12
  • 21
  • Thank you for this background information. And again MS and Excel with something special ;-). – theozh Sep 13 '22 at 17:31
  • I quickly checked, MS Excel 2016 saves the CSV identical to the example in the question, i.e. _with_ double quotes. Maybe earlier versions were different. This RFC4180 is from 2005. Another topic will be double quotes in a cell: it puts the double quote part into double quotes. So, you need to search for double quote pairs. Or in another way: if you have an odd number of double quotes, connect as many lines until you get an even number again and replace all newlines with `\n`. But then gnuplot will have to take care about escaping the double quotes. So, it might get a bit more complicated. – theozh Sep 14 '22 at 06:42
0

Parsing a CSV-file can certainly get more complicated than for this simple example below. Linux users probably have some tools for this.

I prefer gnuplot-only solutions (hence platform-independent), although they maybe cannot compete in terms of speed and efficiency with specialized external tools.

Here is a very "simple" but not very robust gnuplot-only solution which simply adds two successive lines if the first line contains an odd number of double quotes (definitely room for improvements!). For this to work, you need to load the data 1:1 in a datablock, and furthermore, since you are indexing datablocks you need gnuplot>=5.2.0.

Data: SO73704046.csv

1,Simple,1.3
2,Single line,2.3
3,"Multiline
label",3.3
4,Simple again,4.3
5,Multiline\nlabel,5.3
6,Simple again,6.3

Script: (requires gnuplot>=5.2.0)

### How to handle CSV and multi-line labels in gnuplot?
reset session

FILE = 'SO73704046.csv'

FileToDatablock(f,d) = GPVAL_SYSNAME[1:7] eq "Windows" ? \
                       sprintf('< echo   %s ^<^<EOD  & type "%s"',d,f) : \
                       sprintf('< echo "\%s   <<EOD" & cat  "%s"',d,f)     # Linux/MacOS
load FileToDatablock(FILE,'$DataCSV')

oddDQ(s) = int(sum[j=1:strlen(s)] (s[j:j] eq '"'))%2    # returns 1 if string contains odd number of double quotes, otherwise 0

set print $Data
    c = 1
    while c<=|$DataCSV| {
        if (oddDQ($DataCSV[c])) {
            s = $DataCSV[c]
            print s[1:strlen(s)-1].'\n'.$DataCSV[c+1]
            c=c+2
        }
        else {
            print $DataCSV[c]
            c=c+1
        }
    }
set print

set datafile separator comma
set format x "\n"

plot $Data u 1:3:xtic(2) w lp pt 7 lc "red"
### end of script

Result:

Datablock $Data:

1,Simple,1.3
2,Single line,2.3
3,"Multiline\nlabel",3.3
4,Simple again,4.3
5,Multiline\nlabel,5.3
6,Simple again,6.3

enter image description here

theozh
  • 22,244
  • 5
  • 28
  • 72
  • To accommodate more than triple multiple lines, it may be necessary to repeat the joining of lines until the number of '"' is no longer odd. – binzo Sep 14 '22 at 09:20
  • @binzo Yes, that's also my understanding, see my comment to Ethan's answer. I haven't had this case and so far no double quotes in the cells. I will try to adapt the script. – theozh Sep 14 '22 at 09:46