4

I need to parse a CSV file which has this header:

Company;Registered office;Notifying party;Domicile or Registered office;Holdings of voting rights;;;;;;Publication

;;;;directly held;;additionally counted;;total;;in Germany;;in foreign countries

;;;;percentage;single rights;percentage;single rights;percentage;single rights;Official stock exchange

I was wondering whether this is a standard header format, because I expected to have all the fields listed one after another, like (in the first row) "Holdings of voting rights-directly held-percentage;Holdings of voting rights-directly held-single rights", while I see that information spread over three lines.

Currently my file has 6 lines of header (the three shown and other three in another language), how can I detect, if a day they'll add some more header lines?? The file continues with the following line (the first data) and so on. The first line of real data isn't always the same

BBS Kraftfahrzeugtechnik AG;Schiltach;Baumgartner, Heinrich;Deutschland;62,5;;37,5;;100,0;;Börsenzeitung;04.04.2002

I'm also looking for java libraries which are able to parse CSV files.

cdarwin
  • 4,141
  • 9
  • 42
  • 66

7 Answers7

3

I disagree to others who claim that only comma is allowed. Wikipedia, for example, gives a case of German CSV which uses semicolons for CSV separation (as commas are used for decimal separation). I think MS Excel is also pretty much flexible on what delimiters to use. It's just programmers' minds that try to gravitate towards most simplistic case.

For CSV parsing I recommend Ostermiller Utils.

Q> how can I detect, if a day they'll add some more header lines?
A> you can't. The only thing you can rely is either dynamic layout (where you know column names in advance) or static layout (where you assume that this column is always n-th).

mindas
  • 26,463
  • 15
  • 97
  • 154
3

Despite CSV (Comma Seperated Value) files having the word comma in their name, I've seen some very weird stuff in the enterprise world.

I would suggest creating your own representation of the data. It sounds like you may be reading in multiple files all formatted a bit differently?

I would approach the problem in a modular fashion. Have importers for the different formats, bring it in to a normalized data representation that you than do what you want with.

This is all assuming that these files contain the same type of data and that you have no control over the files you are receiving.

Even if this is not the case, abstracting out the data from it's representation and sticking that in a separate project would be useful.

I would also recommend the use of OpenCSV

Casey
  • 12,070
  • 18
  • 71
  • 107
2

This is not a CSV file. You need to get the specification for the file from whoever is generating it.

CSV files are Comma-Separated-Values, with one record per line. It's a loose specification with regards to how to escape commas and escape characters. Excel uses double quotes around values, and then doubled-up double-quotes.

Lou Franco
  • 87,846
  • 14
  • 132
  • 192
  • the records are one per line... it seems that only the header is quite "strange" – cdarwin Jan 06 '11 at 14:03
  • CSV doesn't have "headers" only rows. A rows can contain a header or data or used for any other purpose. – Peter Lawrey Jan 06 '11 at 14:08
  • Openoffice lets me choose the separator when opening the file, and the semicolon can be used too – cdarwin Jan 06 '11 at 14:08
  • Apart from the empty line, the file is CSV format, the separator is `;`. A comma (`,`) is not strictly required as a separator and would almost always conflict with decimal values. – Andreas Dolk Jan 06 '11 at 14:10
2

Yes, you have a legitimate CSV file. I read it in successfully by Excel, and suspect I would have no problem with OpenOffice. For Excel, I saved it as a .txt file, but then had to tell Excel in the opening dialogue that it was delimited by semicolons.

This is "standard" in the sense that it is separating columns by a delimiter (semicolons are OK, as are tabs and of course commas) and rows by new lines.

The reason that you were given this format is because the second and third header lines don't come directly under the first line. "Holdings of voting rights" spans 6 columns. Underneath it, on the second header line, "directly held" spans 2 columns, as does "additional counted" and "total." The third header line breaks down the second header line into "percentage" and "single rights."

I don't think you will easily be able to find when the headers stop and the data begins. This is a semantic problem -- one of meaning. It is easier for a human, though!

rajah9
  • 11,645
  • 5
  • 44
  • 57
1

There is no standard header format. It can been seen as a convention that the first line is a comma separated list of values representing the column headers.

In your case, your table has three header lines (my guess based on counting cells and comparing with the content of your data example).

It is still csv, but you have t know in advance which line is the first line holding actual data. There is no clue given by the format itself.

Andreas Dolk
  • 113,398
  • 19
  • 180
  • 268
  • I downloaded myself the file I'm referring to, and I'm sure there are carriage returns (^M) at the end of the "header" lines.... So, this seems a weird header, something like an header. Commas are semicolons, as commas are used inside the text – cdarwin Jan 06 '11 at 14:06
1

As for CSV headers go, there is no standard format. In all cases, we do assume that first line is a header. Altough if header spans over multiple lines (which I am seeing for first time here) then you would need to know the count of header columns before you start parsing this file. Atleast that is a start.

The next assumption in csv files is generally that one line is one row or record. So usually headers and data are seperated by newline. In your case, I am not sure how you are generating the file and how is it planned to be used.

Sachin Shanbhag
  • 54,530
  • 11
  • 89
  • 103
  • 2
    [It is CSV](http://en.wikipedia.org/wiki/Comma-separated_values), the separator is `;`. A comma (`,`) is not strictly required as a separator and would almost always conflict with decimal values. – Andreas Dolk Jan 06 '11 at 14:09
  • 1
    CSV separators can be different depending on the region. OP is from Germany, and comma is decimal separator there (almost all of Europe) (which is "." in UK/USA), semicolon is used as the separator in CSV. See http://en.wikipedia.org/wiki/Comma-separated_values – Nivas Jan 06 '11 at 14:21
  • @Andreas_D, @Nivas - This is something new for me. Thanks guys. Have edited my answer to remove that csv statement. – Sachin Shanbhag Jan 06 '11 at 14:22
1

With regards to CSV parsing libraries, I would highly recommend OpenCSV.

Also see: Can you recommend a Java library for reading (and possibly writing) CSV files?

Community
  • 1
  • 1
dogbane
  • 266,786
  • 75
  • 396
  • 414