19

I want to write a program for a school java project to parse some CSV I do not know. I do know the datatype of each column - although I do not know the delimiter.

The problem I do not even marginally know how to fix is to parse Date or even DateTime Columns. They can be in one of many formats.

I found many libraries but have no clue which is the best for my needs: http://opencsv.sourceforge.net/ http://www.csvreader.com/java_csv.php http://supercsv.sourceforge.net/ http://flatpack.sourceforge.net/

The problem is I am a total java beginner. I am afraid non of those libraries can do what I need or I can't convince them to do it.

I bet there are a lot of people here who have code sample that could get me started in no time for what I need:

  • automatically split in Columns (delimiter unknown, Columntypes are known)
  • cast to Columntype (should cope with $, %, etc.)
  • convert dates to Java Date or Calendar Objects

It would be nice to get as many code samples as possible by email.

Thanks a lot! AS

Andy Schmidt
  • 251
  • 1
  • 3
  • 9
  • In saying that the delimiter is unknown, does that mean that it is unknown when the input is being processed or when the code is being written (but could be captured before processing the input)? – jt. May 09 '09 at 20:44
  • 1
    Am I missing something here? If it is CSV data, the delimitter is a comma. If it isn't, then it's not in Comma Separated Values format. –  May 09 '09 at 21:25
  • 1
    @Neil: csv is often used with ; or tab as separator (try a german excel). – Arne Burmeister Feb 24 '10 at 23:08
  • csv does not have any type info. The currency or number format is in excel, but not exported to csv. – Arne Burmeister Feb 24 '10 at 23:09
  • @ArneBurmeister Technically, CSV uses a comma while DSV (Delimiter Separated Values) uses comma, tab, semicolon, colon or something else. – Agi Hammerthief May 25 '17 at 10:53

10 Answers10

19

You also have the Apache Commons CSV library, maybe it does what you need. See the guide. Updated to Release 1.1 in 2014-11.

Also, for the foolproof edition, I think you'll need to code it yourself...through SimpleDateFormat you can choose your formats, and specify various types, if the Date isn't like any of your pre-thought types, it isn't a Date.

Basil Bourque
  • 303,325
  • 100
  • 852
  • 1,154
Valentin Rocher
  • 11,667
  • 45
  • 59
  • 7
    Apache Commons "Last Published: 17 May 2008 | Version: 1.0-SNAPSHOT" "There are currently no official downloads, and will not be until CSV moves out of the Sandbox, but a nightly build is available from ..." Hmm, not too convincing... – pihentagy Jun 03 '11 at 12:01
  • @pihentaguy : I know, but all the other ones are older...Anyways, in a production, I'd advise coding a specific parser, as this is not too difficult. – Valentin Rocher Jun 08 '11 at 08:20
  • **[OpenCSV](http://sourceforge.net/projects/opencsv/files/opencsv/2.3/)** was updated at 2011-04-15 according to their SF page. – expert Dec 14 '11 at 01:44
  • 2
    **UPDATE** [Apache Commons CSV](https://commons.apache.org/proper/commons-csv/) did succeed as a project. Version 1.1 released this month (2014-11). – Basil Bourque Nov 25 '14 at 07:28
18

There is a serious problem with using

String[] strArr=line.split(",");

in order to parse CSV files, and that is because there can be commas within the data values, and in that case you must quote them, and ignore commas between quotes.

There is a very very simple way to parse this:

/**
* returns a row of values as a list
* returns null if you are past the end of the input stream
*/
public static List<String> parseLine(Reader r) throws Exception {
    int ch = r.read();
    while (ch == '\r') {
        //ignore linefeed chars wherever, particularly just before end of file
        ch = r.read();
    }
    if (ch<0) {
        return null;
    }
    Vector<String> store = new Vector<String>();
    StringBuffer curVal = new StringBuffer();
    boolean inquotes = false;
    boolean started = false;
    while (ch>=0) {
        if (inquotes) {
            started=true;
            if (ch == '\"') {
                inquotes = false;
            }
            else {
                curVal.append((char)ch);
            }
        }
        else {
            if (ch == '\"') {
                inquotes = true;
                if (started) {
                    // if this is the second quote in a value, add a quote
                    // this is for the double quote in the middle of a value
                    curVal.append('\"');
                }
            }
            else if (ch == ',') {
                store.add(curVal.toString());
                curVal = new StringBuffer();
                started = false;
            }
            else if (ch == '\r') {
                //ignore LF characters
            }
            else if (ch == '\n') {
                //end of a line, break out
                break;
            }
            else {
                curVal.append((char)ch);
            }
        }
        ch = r.read();
    }
    store.add(curVal.toString());
    return store;
}

There are many advantages to this approach. Note that each character is touched EXACTLY once. There is no reading ahead, pushing back in the buffer, etc. No searching ahead to the end of the line, and then copying the line before parsing. This parser works purely from the stream, and creates each string value once. It works on header lines, and data lines, you just deal with the returned list appropriate to that. You give it a reader, so the underlying stream has been converted to characters using any encoding you choose. The stream can come from any source: a file, a HTTP post, an HTTP get, and you parse the stream directly. This is a static method, so there is no object to create and configure, and when this returns, there is no memory being held.

You can find a full discussion of this code, and why this approach is preferred in my blog post on the subject: The Only Class You Need for CSV Files.

AgilePro
  • 5,588
  • 4
  • 33
  • 56
  • For all the concern about "overly complex solutions" this seems overly complex in and of itself for what it is. Using the `Pattern` class would be far easier than this approach. – searchengine27 Oct 09 '17 at 23:59
  • I lot of people think that, but actually just because you end up with fewer lines does not mean that you have a less complex solution. The provided code doesn't do anything complex or intricate. It deals with each character and it has only a single major mode: whether you are in a quoted string or not. It is easy to follow and debug if you had to, and there is no "hidden" processing. I have seen many WRONG examples that use regular expressions and yet fail to parse it properly. I would be glad to look at an actual running example, but I doubt it will be simpler than the above code. – AgilePro Oct 10 '17 at 02:24
  • Setting aside the fact that there is no CSV standard so technically there is no such thing as a valid CSV parser: `Pattern.compile("([^\",]*?|\"(\\\\\"|[^\"])*\"),(.*)");`. That is way simpler. – searchengine27 Oct 10 '17 at 21:27
  • 1
    You have only done part of the job. You then need to "unquote" the strings that have quotes around them, and of course you need the relatively trivial loop that reads from the Reader, loops for each of the found values, and puts the result into the list. Seriously, it would be interesting to see what the final actual working method would be in order to compare results. – AgilePro Oct 11 '17 at 22:10
  • OK seriously now...you're just splitting hairs. That is the working solution. I'm not going to include my javadocs, my log statements, and my mother's maiden name into a comment...let's get real here and how about you stop disagreeing just to disagree. If you're not going to provide a solid reason to disagree, then I am genuinely not interested in a response, if we're being completely honest. – searchengine27 Oct 11 '17 at 23:01
  • Your parser is "broken" for unquoted fields containing quotes which are just part of the text like e.g., `795851;Monin "Chai" Tee Sirup 0,7l;6`. This is a real example and your parser just drops the quotes. That's no real bug as the behavior is unspecified, but files from maybe five of twenty different sources I get suffer from this. – maaartinus Nov 18 '17 at 00:55
  • My understanding is that if you have a quote in the text, you MUST put quotes around the entire value. According to RFC 4180 "Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. " https://tools.ietf.org/html/rfc4180 I realize that there is no definitive definition, but this seems quite reasonable. – AgilePro Nov 21 '17 at 05:50
  • @AgilePro I agree with your approach of reading one character at a time. However I disagree that a parser should fail on bad data. I think a parser should skip over the bad data (generate a meaningful message) and keep going. – John Henckel Jan 19 '18 at 15:28
  • I created a class that is more permissive. https://gist.github.com/jdhenckel/311c73f199f628bf2106348f32405658 – John Henckel Jan 19 '18 at 15:38
  • For those claiming that a giant ass regex and all the baggage with the regex; you are wrong. Try timing the two different methods of parsing. You will very quickly find this is going to be much faster and more consistent. Though not a bug, maaatinus' note is worth dealing with. – Tatarize Oct 25 '18 at 03:54
  • @JohnHenckel your gist lacks a license. – Tatarize Oct 25 '18 at 03:59
  • I created a better function for it. Uses a case statement and more importantly wraps into an iterable so you can use and extended for statement to easily deal with the parsed material. https://gist.github.com/tatarize/79aea7442fd5a173e639378748e747c9 – Tatarize Oct 25 '18 at 05:00
  • @Tatarize well done. I've added CC0 to mine. thanks – John Henckel Oct 29 '18 at 14:13
7

My approach would not be to start by writing your own API. Life's too short, and there are more pressing problems to solve. In this situation, I typically:

  • Find a library that appears to do what I want. If one doesn't exist, then implement it.
  • If a library does exist, but I'm not sure it'll be suitable for my needs, write a thin adapter API around it, so I can control how it's called. The adapter API expresses the API I need, and it maps those calls to the underlying API.
  • If the library doesn't turn out to be suitable, I can swap another one in underneath the adapter API (whether it's another open source one or something I write myself) with a minimum of effort, without affecting the callers.

Start with something someone has already written. Odds are, it'll do what you want. You can always write your own later, if necessary. OpenCSV is as good a starting point as any.

Brian Clapper
  • 25,705
  • 7
  • 65
  • 65
3

i had to use a csv parser about 5 years ago. seems there are at least two csv standards: http://en.wikipedia.org/wiki/Comma-separated_values and what microsoft does in excel.

i found this libaray which eats both: http://ostermiller.org/utils/CSV.html, but afaik, it has no way of inferring what data type the columns were.

ekangas
  • 843
  • 1
  • 10
  • 17
Ray Tayek
  • 9,841
  • 8
  • 50
  • 90
  • 2
    The license for this is GPL. If you want to use this in a commercial product or with non-GPL code it is NOT a solution. – Brendan Mar 17 '14 at 04:34
2

You might want to have a look at this specification for CSV. Bear in mind that there is no official recognized specification.

If you do not now the delimiter it will not be possible to do this so you have to find out somehow. If you can do a manual inspection of the file you should quickly be able to see what it is and hard code it in your program. If the delimiter can vary your only hope is to be able to deduce if from the formatting of the known data. When Excel imports CSV files it lets the user choose the delimiter and this is a solution you could use as well.

willcodejavaforfood
  • 43,223
  • 17
  • 81
  • 111
1

I agree with @Brian Clapper. I have used SuperCSV as a parser though I've had mixed results. I enjoy the versatility of it, but there are some situations within my own csv files for which I have not been able to reconcile "yet". I have faith in this product and would recommend it overall--I'm just missing something simple, no doubt, that I'm doing in my own implementation.

SuperCSV can parse the columns into various formats, do edits on the columns, etc. It's worth taking a look-see. It has examples as well, and easy to follow.

The one/only limitation I'm having is catching an 'empty' column and parsing it into an Integer or maybe a blank, etc. I'm getting null-pointer errors, but javadocs suggest each cellProcessor checks for nulls first. So, I'm blaming myself first, for now. :-)

Anyway, take a look at SuperCSV. http://supercsv.sourceforge.net/

Davidson
  • 1,064
  • 3
  • 20
  • 35
0

Writing your own parser is fun, but likely you should have a look at Open CSV. It provides numerous ways of accessing the CSV and also allows to generate CSV. And it does handle escapes properly. As mentioned in another post, there is also a CSV-parsing lib in the Apache Commons, but that one isn't released yet.

Ichthyo
  • 8,038
  • 2
  • 26
  • 32
0

At a minimum you are going to need to know the column delimiter.

Richard West
  • 2,166
  • 4
  • 26
  • 40
  • 1
    Not necessarily. If he knows the datatype of the first column, he can just consume the first line while it conforms to the datatype. Then, the first character will be the delimiter. – fishlips May 09 '09 at 20:41
  • OK - let's say I know the delimiter. Can you provide me with a working that shows how I can do this thing? Especially bringing Dates into Java and converting Numbers that have things lik $, %, etc. ? – Andy Schmidt May 09 '09 at 21:03
  • Just so I understand, are you wanting to store "$9.99" as the value 9.99 in a numeric field? – Richard West May 09 '09 at 21:24
0

Basically you will need to read the file line by line.

Then you will need to split each line by the delimiter, say a comma (CSV stands for comma-separated values), with

String[] strArr=line.split(",");

This will turn it into an array of strings which you can then manipulate, for example with

String name=strArr[0];
int yearOfBirth = Integer.valueOf(strArr[1]);
int monthOfBirth = Integer.valueOf(strArr[2]);
int dayOfBirth = Integer.valueOf(strArr[3]);
GregorianCalendar dob=new GregorianCalendar(yearOfBirth, monthOfBirth, dayOfBirth);
Student student=new Student(name, dob); //lets pretend you are creating instances of Student

You will need to do this for every line so wrap this code into a while loop. (If you don't know the delimiter just open the file in a text editor.)

Leonard Ehrenfried
  • 1,573
  • 3
  • 20
  • 34
  • I was talking about more elaborate example using one of the example libraries listed above. The trivial things like string.split() I thought about myself ;) I was talking about foolproof parsing of VARIOUS Date formats and Numbers which contain $ and %. – Andy Schmidt May 09 '09 at 21:21
  • May I ask what the $s and %s stand for in your dates and numbers? – Leonard Ehrenfried May 09 '09 at 21:35
  • 19
    splitting on commas is not safe - CSVs can have strings that contain commas. The opencsv and Apaches libraries take care of all of this parsing - best to use them. – Kevin Day May 10 '09 at 06:31
0

I would recommend that you start by pulling your task apart into it's component parts.

  1. Read string data from a CSV
  2. Convert string data to appropriate format

Once you do that, it should be fairly trivial to use one of the libraries you link to (which most certainly will handle task #1). Then iterate through the returned values, and cast/convert each String value to the value you want.

If the question is how to convert strings to different objects, it's going to depend on what format you are starting with, and what format you want to wind up with.

DateFormat.parse(), for example, will parse dates from strings. See SimpleDateFormat for quickly constructing a DateFormat for a certain string representation. Integer.parseInt() will prase integers from strings.

Currency, you'll have to decide how you want to capture it. If you want to just capture as a float, then Float.parseFloat() will do the trick (just use String.replace() to remove all $ and commas before you parse it). Or you can parse into a BigDecimal (so you don't have rounding problems). There may be a better class for currency handling (I don't do much of that, so am not familiar with that area of the JDK).

Kevin Day
  • 16,067
  • 8
  • 44
  • 68