Tips for writing a file parser in Java?

Question

EDIT: I'm mostly parsing "comma-seperated values", fuzzy brought that term to my attention.

Interpreting the blocks of CSV are the main question here.

I know how to read the file into something like a String[] and some of the basic features of String, but I don't think using methods like contains() and analyzing everything character by character will work.

What are some ways I can do this in a smarter way?

Example of a line:

-barfoob: boobs, foob, "foo bar"

No, I'm not following any standards or going to use XML; it would just unnecessarily complicate things. — defectivehalt, Jan 27 '10 at 12:59
@Kavon, if your input files aren't XML, then you might want to post sample contents of your input files, because the best way to parse each file would depend on what it is you are parsing... — bguiz, Jan 27 '10 at 13:22
Wait a second: you believe that using a non-standard format that requires you to write your own parser does *not* unnecessarily complicate things?? — Michael Borgwardt, Jan 27 '10 at 13:57
@Michael, It would complicate things for a person who is manually editing/adding data to the file. I don't care if it gets complicated for me. — defectivehalt, Jan 27 '10 at 14:12
that example doesn't look formatted anything remotely like HTML. — , Jan 27 '10 at 14:17
@fuzzy, That is why I said it is similar to html only in regards to how it separates blocks of data like in the example. — defectivehalt, Jan 27 '10 at 14:31
no you state "formatted like HTML", and your example is NOT formatted like HTML, not even close. What you posted as an example is more like CSV with a : as the first separator. Your example is NOT self documenting like HTML or XML at all. There is no obvious tag, attributes or data, it is just a bunch of words colon and comma separated — , Jan 27 '10 at 14:36
@fuzzy, You're assuming that's a full example. It's a HTML/CSV hybrid. — defectivehalt, Jan 27 '10 at 15:24
you don't state that is a partial example, post a full example if you want an actual answer, I still think YAML is smarter than trying to create your own half baked solution, and nothing about that partial example even looks like HTML — , Jan 27 '10 at 15:27

score 7 · Answer 1 · answered Jan 27 '10 at 14:06

There's a reason that everyone assumes you're talking about XML: inventing a proprietary text-based file format requires very strong justification in the face of the maturity and easy availability of XML parsers.

And your question indicates that you have very little prior knowledge about parsers (otherwise you'd be writing an ANTLR or JavaCC grammar instead of asking this question) - which is another strong argument against rolling your own, except as a learning experience.

Well, yes, it is mostly a learning experience. The proprietary aspect is also very justified. — defectivehalt, Jan 27 '10 at 15:17

bguiz · Answer 2 · 2010-01-27T02:06:53.063

6

Since the input is "formatted similarly to HTML", then it is likely that your data is best represented using a tree-like structure, and also, it is likely that it is XML or similar to XML.

If this is the case, I propose the smartest way to parse your file is to use an XML parser.

Here are some resources you may find helpful:

A chapter on XML parsing from Sun: http://java.sun.com/developer/Books/xmljava/ch03.pdf
An article that might help you get started qucikly: http://onjava.com/pub/a/onjava/2002/06/26/xml.html

HTH

edited Jan 27 '10 at 02:06

answered Jan 27 '10 at 02:01

bguiz

27,371
47
154
243

The data is not XML and if it were it would look horrendous and not be human friendly. – defectivehalt Jan 27 '10 at 13:23

score 2 · Answer 3 · edited May 23 '17 at 10:33

2

If the document is valid XML, then any of the other answers will work. If it's not, you'll have to lex.

edited May 23 '17 at 10:33

Community

1
1

answered Jan 27 '10 at 02:10

Dan Rosenstark

68,471
58
283
421

score 2 · Answer 4 · 2010-01-27T14:51:52.227

2

you should look at ANTLR even if you want to write the parser yourself, ANTLR is a great alternative. Or at least look at YAML

edited Jan 27 '10 at 14:51

answered Jan 27 '10 at 14:15

score 2 · Accepted Answer · edited May 23 '17 at 12:18

2

This and digging through wikipedia for related articles will probably suffice.

edited May 23 '17 at 12:18

Community

1
1

answered Jan 27 '10 at 15:37

defectivehalt

2,462
3
21
22

score 2 · Answer 6 · answered Jan 27 '10 at 23:16

2

I think the java.util.Scanner will help you. Have a look at http://java.sun.com/javase/6/docs/api/java/util/Scanner.html

answered Jan 27 '10 at 23:16

Jonas

121,568
97
310
388

wow I never thought of using Scanner, thanks! – defectivehalt Jan 28 '10 at 05:48

score 1 · Answer 7 · answered Jan 27 '10 at 02:04

1

Depending on how complicated your "schema" is, a regular expression might be what you want. If there is a lot of nesting then it might be easiest to convert to XML or JSON and use a prebuilt parser.

answered Jan 27 '10 at 02:04

mlathe

2,375
1
23
42

score 1 · Answer 8 · answered Jan 27 '10 at 15:47

People are right about standard formats being best practice, but let's set that aside.

Assuming that the example you give is representative, the task is pretty trivial.

You show a line with an initial token, demarked with a colon-space, then a list of comma-separated values. Separate at that first colon-space, and then use split() on the part to the right. Handling of the quotes is trivial, too.

score 1 · Answer 9 · answered Jan 27 '10 at 22:40

After looking at your sample input, I fail to see any resemblance to HTML or XML:

-barfoob: boobs, foob, "foo bar"

If this is what you want to parse, I have an alternative suggestion, to use the Java properties parser (comes with standard Java), and then parse the remainder of each line using your own custom code. You will need to refactor your format somewhat in order for this to work, so it's up to you.

barfoob=boobs, foob, "foo bar"

Java properties will be be able to return you barfoob as the property name, and boobs, foob, "foo bar" as the property value. That's where you can use your custom code to split the property value into boobs, foob and foo bar.

score 1 · Answer 10 · answered Jan 27 '10 at 23:07

1

I'd strongly advice to not reinvent the wheel and use an existing solution like Flatworm, Fixedformat4j or jFFP that can all parse positional or comma-separated values files (personally, I recommend Flatworm).

answered Jan 27 '10 at 23:07

Pascal Thivent

562,542
136
1,062
1,124

score 0 · Answer 11 · answered Jan 27 '10 at 02:04

0

You may be able to use the Neko HTML parser to some degree. It depends on how it handles the non-standard HTML.

answered Jan 27 '10 at 02:04

Damo

11,410
5
57
74

score 0 · Answer 12 · answered Jan 27 '10 at 02:06

0

If the XML is valid, I personally prefer using http://www.xom.nu simply because it features a nice DOM model. As pointed out, though, there are parsers in J2SE.

answered Jan 27 '10 at 02:06

What on earth is wrong with adding a preference for an XML library? – Jan 27 '10 at 18:24

Tips for writing a file parser in Java?

12 Answers12

Linked