1

I have a XML file of about 10GB, I dont know the content of the file but I would like to import it into my database to make it easy to view.

How can I import an xml file to my PostgreSQL database? (Is this even possible with such a large file?)

I Hope you guys can help me out :)

nick bijmoer
  • 99
  • 1
  • 2
  • 14
  • I'm just wondering about your objectives here. If you don't know what's in the file, then you're presumably just dumping it in the database unchanged. Is it really going to be easier to make any sense of it once it's in the database? What I have done in the past to try and make sense of such files is first to generate a schema, and from that to write a stylesheet to extract parts of the file that I can then study to gradually build an understanding of the content. – Michael Kay Jan 09 '19 at 14:54
  • I thought that maybe there is a tool that imports the XML file in the database, so that all the columns and data would be automatically inserted in one table. The file is too big to read in Excel, so I thought, if I can import it in my database, I can query a subset of the data. – nick bijmoer Jan 09 '19 at 22:06
  • I would start by generating a schema. There are lots of tools for generating a schema from an instance. I wrote one myself (DTDGenerator at saxon.sf.net) which is now ancient and probably not the best, but I do know that it's fully streamable and can handle very large inputs. (Also, peeking at the start of the file using "more" on the command line isn't a bad idea.) – Michael Kay Jan 09 '19 at 22:39

1 Answers1

0
  1. Convert XML file into CSV file. Also, when converting, split it to 100Mb-1Gb parts for easier batching.

  2. Create the table with columns you defined in the CSV file.

  3. Upload the file(s) into Postgres with COPY command. It is the fastest way to upload a large amount of data I know. BTW, it could be done from Java too, with CopyManager class.

Depending on the kind of queries you will perform, you will like to create indexes:

  1. It will be the most time-consuming part. However, you may use CREATE INDEX CONCURRENTLY. It will allow you to work with your table while the index is created in the background.

  2. If you repeat the import process and already have the table and indexes created, drop the indexes before issuing the COPY command and recreate them later. It will save you much time.

  3. If you are still unhappy with the speed of your queries or the speed of indexes creation, maybe it will be a good idea to use Clickhouse instead. However, it depends on what kind of queries you perform.

Nikita Tukkel
  • 1,975
  • 1
  • 9
  • 15