I want to use postgres' full text search on the largest public natural language corpus available. I downloaded this wikimedia dump stub of a few megs for the example, the target is to work further with dumps around 70GB uncompressed. Here is the xsd. I know there are other open parallel corpora easier to work with, I want to focus on wikimedia here.
This might seem like a duplicate, but I would like to investigate a simpler approach compared to the other proposals I found : postgres mailing list and lo, postgres mailing list and js, here with pg_read_file, here with nodejs, here with splitting, here with splitting + csv...
I would like to preprocess the xml before entering postgres, and stream it with the copy command. BaseX allows serialization of xml to csv/text with commandline and xpath. I already have some stub xpath command within postgres.
The text tag in the XML encompasses huge text blobs, the wikipedia article contents in wikitext, and those are tricky to put into csv (quotes, double quotes, newlines + all the wikitext weird syntax) so I wonder about the format. I would ideally like a stream, thinking currently of :
basex [-xpath command] | psql -c 'COPY foo FROM stdin (format ??)'
Here is my question : can basex process the xml input, and output the transform in stream rather than in batch? If yes, what output format could I use to load into postgres?
I intend to eventually have the data stored in the mediawiki postgresql schema (at the bottom of the link), but I will fiddle with a toy schema with no index, no trigger... first. The problem of wikitext remains, but that's another story.