20

I'm looking for a parser in Java that can parse a document formatted in SGML.

For duplicate monitors: I'm aware of the two other threads that discuss this topic: Parsing Java String with SGML Java SGML to XML conversion? But neither has a resolution, hence the new topic.

For people that confuse XML with SGML: Please read this: http://www.w3.org/TR/NOTE-sgml-xml-971215#null (in short, there are enough subtle differences to at least make it unusable in it's vanilla form)

For people who are fond of asking posters to Google it: I already did and the closest I could come up with was the widely popular SAXParser: http://download.oracle.com/javase/1.4.2/docs/api/javax/xml/parsers/SAXParser.html But that of course is meant to be an XML parser. I'm looking around to see if anyone has implemented a modification of the SAX Parser to accommodate SGML.

Lastly, I cannot use SX as I'm looking for a Java solution.

Thanks! :)

Community
  • 1
  • 1
user183037
  • 2,549
  • 4
  • 31
  • 42
  • 1
    People still use SGML? I'm genuinely curious - what's it used for in your case? – skaffman Feb 01 '11 at 21:26
  • I have around 2500 documents that are formatted in SGML - I need to use the data for some statistical analysis. I'm hashing together a quick program to determine the distribution of the tags, I looked through a few of them and they only seem to be using a select few tags. In which case, I could easily use the SAXParser. – user183037 Feb 01 '11 at 22:31
  • I have tens of thousands of SGML files, and more are made all the time. SGML is still quite widely used in the publishing industry, however untrendy! – Woody Feb 02 '11 at 19:21
  • Luckily my bunch of tags fit the description of XML tags, so I was able to use the SAXParser. (It was easier to use than the XMLReader - examples of how to implement the XMLReader were surprisingly sparse.) – user183037 Feb 04 '11 at 03:37
  • 3
    Wait -- that last comment says "I was able to use the SAX parser." So you've found the answer, no? Why not write an answer and mark this completed? – Charlie Martin Mar 05 '11 at 17:23
  • 1
    Yes and no. I was looking for a SGML parser - that was the initial goal. My using a SAX Parser is just a workaround since it fit my current set of documents, and not the solution. – user183037 Mar 10 '11 at 22:49

6 Answers6

4

I have a few approaches to this problem

The first is what you did -- check to see if the sgml document is close enough to XML for the standard SAX parser to work.

The second is to do the same with HTML parsers. The trick here is to find one that doesn't ignore non-HTML elements.

I did find some Java SGML parsers, more in acedemia, when searching for "sgml parser Java". I do not know how well they work.

The last step is to take a standard (non Java) SGML parser and transform the documents into something you can read in Java.

It looks like you were able to work with the first step.

Kathy Van Stone
  • 25,531
  • 3
  • 32
  • 40
2

I use OpenSP via JNI, as it seems there is no pure Java SGML parser. I've written an experimental SAX-like wrapper that is available at http://sourceforge.net/projects/sasgml (of course, it has all the drawbacks of JNI... but was enough for my requirements).

Another approach is converting the document to XML by using sx from Open SP, and then run a traditional SAX parser.

Community
  • 1
  • 1
Javier
  • 12,100
  • 5
  • 46
  • 57
1

Java SE includes an HTML parser in the javax.swing.text.html.parser package. It claims in its documentation to be a general SGML parser, but then counterclaims in the documentation that you should only use it with the provided HTML DTD class.

If you put it in lenient mode and your SGML documents don't have a lot of implied end tags, you may get reasonable results.

Read about the parser in its JavaDoc, here: http://docs.oracle.com/javase/6/docs/api/javax/swing/text/html/parser/DocumentParser.html

Create an instance like this:

new DocumentParser(DTD.getDTD("html32"))

Or you could ignore the warnings against using a custom DTD with DocumentParser, and create a subclass of DTD that matches the rules of your own SGML format.

This is clearly not an industrial strength SGML parser, but it should be a good starting point for a one-time data migration effort. I've found it useful in previous projects for parsing HTML.

Jonathan Fuerth
  • 2,080
  • 2
  • 18
  • 21
1

There is no api for parsing SGML using Java at this time. There also isn't any api or library for converting SGML to XML and then parsing it using Java. With the status of SGML being supplanted by XML for all the projects I've worked on until now, I don't think there will every be any work done in this area, but that is only a guess.

Here is some open source code code from a University that does it, however I haven't tried it and you would have to search to find the other dependent classes. I believe the only viable solution in Java would require Regular Expressions.

Also, here is a link for public SGML/XML software.

James Drinkard
  • 15,342
  • 16
  • 114
  • 137
0

If its HTML that you're parsing, this might do:

http://ccil.org/~cowan/XML/tagsoup/

ScootyPuff
  • 1,335
  • 9
  • 18
0

Though its a very old post and I'm not claiming that the answer I am providing is perfect but it served my purpose. So I am keeping this code I wrote using stack to get the data in a way was required in my case. I hope it may be helpful for others.

try (BufferedReader br = new BufferedReader(new FileReader(new File(
                fileName)))) {
            while ((line = br.readLine()) != null) {
                line = line.trim();
                int startOfTag = line.indexOf("<");
                int endOfTag = line.indexOf(">");
                String currentTag = "";

                if (startOfTag > -1 && endOfTag > -1) {
                    if (countStart)
                        headerTagsCounter++;
                    currentTag = line.substring(startOfTag + 1, endOfTag);
                    String currentData = line.substring(endOfTag + 1,
                            line.length());
                    if (i == 1) {
                        tagStack.push(currentTag);
                        i++;
                    }
                    if (currentData.isEmpty() || currentData == "") {//If there is no data, its a parent tag...
                        if (!currentTag.contains("/")) {// if its an opening tag...
                            switch (currentTag) {// these tags are useless in my case, so just skipping these tags.
                            case "CORRECTION":
                            case "PAPER":
                            case "PRIVATE-TO-PUBLIC":
                            case "DELETION":
                            case "CONFIRMING-COPY":
                            case "CAPTION":
                            case "STUB":
                            case "COLUMN":
                            case "TABLE-FOOTNOTES-SECTION":
                            case "FOOTNOTES":
                            case "PAGE":
                                break;
                            default: {
                                countStart = false;
                                int tagCounterNumber = 0;
                                String historyTagToRemove = "";
                                for (String historyTag : historyStack) {
                                    String tagCounter = "";
                                    if (historyTag.contains(currentTag)) {//if it's  a repeating tag..Append the counter  and update the same  in history tag..
                                        historyTagToRemove = historyTag;
                                        if (historyTag
                                                .equalsIgnoreCase(currentTag)) {
                                            tagCounterNumber = 1;
                                        } else if (historyTag.length() > currentTag
                                                .length()) {
                                            tagCounter = historyTag
                                                    .substring(currentTag
                                                            .length());
                                            if (tagCounter != null
                                                    && !tagCounter.isEmpty()) {
                                                tagCounterNumber = Integer
                                                        .parseInt(tagCounter) + 1;
                                            }
                                        }
                                    }
                                }
                                if (tagCounterNumber > 0)
                                    currentTag += tagCounterNumber;
                                if (historyTagToRemove != null
                                        && !historyTagToRemove.isEmpty()) {
                                    historyStack.remove(historyTagToRemove);
                                    historyStack.push(currentTag);
                                }
                                tagStack.push(currentTag);
                                break;
                            }
                            }
                        } else// if its end of a tag... Match the current tag with top of stack and if its a match, pop  it out
                        {
                            currentTag = currentTag.substring(1);
                            String tagRemoved = "";
                            String topStackTag = tagStack.lastElement();
                            if (topStackTag.contains(currentTag)) {
                                tagRemoved = tagStack.pop();
                                historyStack.push(tagRemoved);
                            }
                            if (tagStack.size() < 2)
                                cik = "";
                            if (tagStack.size() == 2 && cik != null
                                    && !cik.isEmpty())
                                for (int j = headerTagsCounter - 1; j < tagList.size(); j++) {
                                    String item = tagList.get(j);
                                    if (!item.contains("@@")) {
                                        item += "@@" + cik;
                                        tagList.remove(j);
                                        tagList.add(j, item);
                                    }
                                }
                        }
                    } else {// if current tag has some data...
                        currentData = currentData.trim();
                        String stackValue = "";
                        for (String tag : tagStack) {
                            if (stackValue != null && !stackValue.isEmpty()
                                    && stackValue != "")
                                stackValue = stackValue + "||" + tag;
                            else
                                stackValue = tag;
                        }
                        switch (currentTag) {
                        case "ACCESSION-NUMBER":
                            accessionNumber = currentData;
                            break;
                        case "FILING-DATE":
                            dateFiled = currentData;
                            break;
                        case "TYPE":
                            formType = currentData;
                            break;
                        case "CIK":
                            cik = currentData;
                            break;
                        }
                        tagList.add(stackValue + "$$" + currentTag + "::"+ currentData);
                    }
                }
            }
// Now all your data is available with in tagList, stack is separated by ||,  key is separated by $$ and value is separated by ::
            }
        } catch (Exception e) {
            // TODO Auto-generated catch block
        }

    }

Output:

Source of file: http://10k-staging.s3.amazonaws.com/edgar0105/2016/12/20/935015/000119312516799070/0001193125-16-799070.hdr.sgml

Output of code:

SEC-HEADER$$SEC-HEADER::0001193125-16-799070.hdr.sgml : 20161220
SEC-HEADER$$ACCEPTANCE-DATETIME::20161220172458
SEC-HEADER$$ACCESSION-NUMBER::0001193125-16-799070
SEC-HEADER$$TYPE::485APOS
SEC-HEADER$$PUBLIC-DOCUMENT-COUNT::9
SEC-HEADER$$FILING-DATE::20161220
SEC-HEADER$$DATE-OF-FILING-DATE-CHANGE::20161220
SEC-HEADER||FILER||COMPANY-DATA$$CONFORMED-NAME::ARTISAN PARTNERS FUNDS INC@@0000935015
SEC-HEADER||FILER||COMPANY-DATA$$CIK::0000935015@@0000935015
SEC-HEADER||FILER||COMPANY-DATA$$IRS-NUMBER::391811840@@0000935015
SEC-HEADER||FILER||COMPANY-DATA$$STATE-OF-INCORPORATION::WI@@0000935015
SEC-HEADER||FILER||COMPANY-DATA$$FISCAL-YEAR-END::0930@@0000935015
SEC-HEADER||FILER||FILING-VALUES$$FORM-TYPE::485APOS@@0000935015
SEC-HEADER||FILER||FILING-VALUES$$ACT::33@@0000935015
SEC-HEADER||FILER||FILING-VALUES$$FILE-NUMBER::033-88316@@0000935015
SEC-HEADER||FILER||FILING-VALUES$$FILM-NUMBER::162062197@@0000935015
SEC-HEADER||FILER||BUSINESS-ADDRESS$$STREET1::875 EAST WISCONSIN AVE STE 800@@0000935015
SEC-HEADER||FILER||BUSINESS-ADDRESS$$CITY::MILWAUKEE@@0000935015
SEC-HEADER||FILER||BUSINESS-ADDRESS$$STATE::WI@@0000935015
SEC-HEADER||FILER||BUSINESS-ADDRESS$$ZIP::53202@@0000935015
SEC-HEADER||FILER||BUSINESS-ADDRESS$$PHONE::414-390-6100@@0000935015
SEC-HEADER||FILER||MAIL-ADDRESS$$STREET1::875 EAST WISCONSIN AVE STE 800@@0000935015
SEC-HEADER||FILER||MAIL-ADDRESS$$CITY::MILWAUKEE@@0000935015
SEC-HEADER||FILER||MAIL-ADDRESS$$STATE::WI@@0000935015
SEC-HEADER||FILER||MAIL-ADDRESS$$ZIP::53202@@0000935015
SEC-HEADER||FILER||FORMER-COMPANY$$FORMER-CONFORMED-NAME::ARTISAN FUNDS INC@@0000935015
SEC-HEADER||FILER||FORMER-COMPANY$$DATE-CHANGED::19950310@@0000935015
SEC-HEADER||FILER||FORMER-COMPANY1$$FORMER-CONFORMED-NAME::ZIEGLER FUNDS INC@@0000935015
SEC-HEADER||FILER||FORMER-COMPANY1$$DATE-CHANGED::19950109@@0000935015
SEC-HEADER||FILER1||COMPANY-DATA1$$CONFORMED-NAME::ARTISAN PARTNERS FUNDS INC@@0000935015
SEC-HEADER||FILER1||COMPANY-DATA1$$CIK::0000935015@@0000935015
SEC-HEADER||FILER1||COMPANY-DATA1$$IRS-NUMBER::391811840@@0000935015
SEC-HEADER||FILER1||COMPANY-DATA1$$STATE-OF-INCORPORATION::WI@@0000935015
SEC-HEADER||FILER1||COMPANY-DATA1$$FISCAL-YEAR-END::0930@@0000935015
SEC-HEADER||FILER1||FILING-VALUES1$$FORM-TYPE::485APOS@@0000935015
SEC-HEADER||FILER1||FILING-VALUES1$$ACT::40@@0000935015
SEC-HEADER||FILER1||FILING-VALUES1$$FILE-NUMBER::811-08932@@0000935015
SEC-HEADER||FILER1||FILING-VALUES1$$FILM-NUMBER::162062198@@0000935015
SEC-HEADER||FILER1||BUSINESS-ADDRESS1$$STREET1::875 EAST WISCONSIN AVE STE 800@@0000935015
SEC-HEADER||FILER1||BUSINESS-ADDRESS1$$CITY::MILWAUKEE@@0000935015
SEC-HEADER||FILER1||BUSINESS-ADDRESS1$$STATE::WI@@0000935015
SEC-HEADER||FILER1||BUSINESS-ADDRESS1$$ZIP::53202@@0000935015
SEC-HEADER||FILER1||BUSINESS-ADDRESS1$$PHONE::414-390-6100@@0000935015
SEC-HEADER||FILER1||MAIL-ADDRESS1$$STREET1::875 EAST WISCONSIN AVE STE 800@@0000935015
SEC-HEADER||FILER1||MAIL-ADDRESS1$$CITY::MILWAUKEE@@0000935015
SEC-HEADER||FILER1||MAIL-ADDRESS1$$STATE::WI@@0000935015
SEC-HEADER||FILER1||MAIL-ADDRESS1$$ZIP::53202@@0000935015
SEC-HEADER||FILER1||FORMER-COMPANY2$$FORMER-CONFORMED-NAME::ARTISAN FUNDS INC@@0000935015
SEC-HEADER||FILER1||FORMER-COMPANY2$$DATE-CHANGED::19950310@@0000935015
SEC-HEADER||FILER1||FORMER-COMPANY3$$FORMER-CONFORMED-NAME::ZIEGLER FUNDS INC@@0000935015
SEC-HEADER||FILER1||FORMER-COMPANY3$$DATE-CHANGED::19950109@@0000935015
SEC-HEADER||SERIES-AND-CLASSES-CONTRACTS-DATA||NEW-SERIES-AND-CLASSES-CONTRACTS$$OWNER-CIK::0000935015
SEC-HEADER||SERIES-AND-CLASSES-CONTRACTS-DATA||NEW-SERIES-AND-CLASSES-CONTRACTS||NEW-SERIES$$SERIES-ID::S000056665
SEC-HEADER||SERIES-AND-CLASSES-CONTRACTS-DATA||NEW-SERIES-AND-CLASSES-CONTRACTS||NEW-SERIES$$SERIES-NAME::Artisan Thematic Fund
SEC-HEADER||SERIES-AND-CLASSES-CONTRACTS-DATA||NEW-SERIES-AND-CLASSES-CONTRACTS||NEW-SERIES||CLASS-CONTRACT$$CLASS-CONTRACT-ID::C000179292
SEC-HEADER||SERIES-AND-CLASSES-CONTRACTS-DATA||NEW-SERIES-AND-CLASSES-CONTRACTS||NEW-SERIES||CLASS-CONTRACT$$CLASS-CONTRACT-NAME::Investor Shares
Shailesh Saxena
  • 3,472
  • 2
  • 18
  • 28