The section Chunker Training of the official OpenNLP manual mentions a reference to the raw data used for the training of the EN language model files:
The training data can be converted to the OpenNLP chunker training format, that is based on CoNLL2000.
You will also find other references, e.g, Chapter 12. Corpora, to external resources used in/for OpenNLP.
Additionally, the CoNLL2003 corpus might be of interest:
The English data is the Reuters Corpus, which is a collection of news wire articles. The Reuters Corpus can be obtained free of charges from the NIST for research purposes: http://trec.nist.gov/data/reuters/reuters.html
Hope it helps.