How to export full-text files with SQL?

Question

There are an easy way to import/export full-text fields as files?

that solve the problem of "load as multiple lines".
Trying with SQL's COPY I can only to transform full-file into full-table, not into a single text field, because each line from COPY is a raw.
that solve the save-back problem, to save the full XML file in the filesystem, without changes in bynary representation (preserving SHA1), and without other exernal procedures (as Unix sed use).

The main problem is on export, so this is the title of this page.

PS: the "proof of same file" in the the round trip — import, export back and compare with original — can be obtained by sha1sum demonstration; see examples below. So, a natural demand is also to check same SHA1 by SQL, avoiding to export on simple check tasks.

All examples

Import a full text into a full-table (is not what I need),
and test that can export as the same text.
PS: I need to import one file into one field and one row.
Transform full table into one file (is not what I need)
and test that can export as same text.
PS: I need one row (of one field) into one file.
Calculate the hash by SQL, the SHA1 of the field.
Must be the same when compare ... Else it is not a solution for me.

The folowing examples show each problem and a non-elegant workaround.

1. Import

CREATE TABLE ttmp (x text);
COPY ttmp FROM '/tmp/test.xml' ( FORMAT text ); -- breaking lines lines
COPY (SELECT x FROM ttmp) TO '/tmp/test_back.xml' (format TEXT);

Checking that original and "back" have exactly the same content:

sha1sum /tmp/test*.*
  570b13fb01d38e04ebf7ac1f73dfad0e1d02b027  /tmp/test_back.xml
  570b13fb01d38e04ebf7ac1f73dfad0e1d02b027  /tmp/test.xml

PS: seems perfect, but the problem here is the use of many rows. A real import-solution can import a file into a one-row (and one field). A real export-solution is a SQL function that produce test_back.xml from a single row (of a single field).

2. Transform full table into one file

Use it to store XML:

CREATE TABLE xtmp (x xml);
INSERT INTO  xtmp (x) 
  SELECT array_to_string(array_agg(x),E'\n')::xml FROM ttmp
;
COPY (select x::text from xtmp) TO '/tmp/test_back2-bad.xml' ( FORMAT text );

... But not works as we can check by sha1sum /tmp/test*.xml, not produce the same result for test_back2-bad.xml.

So do also a translation from \n to chr(10), using an external tool (perl, sed or any other)
perl -p -e 's/\\n/\n/g' /tmp/test_back2-bad.xml > /tmp/test_back2-good.xml

Ok, now test_back2-good.xml have the same hash ("570b13fb..." in my example) tham original. Use of Perl is a workaround, how to do without it?

3. The SHA1 of the field

SELECT encode(digest(x::text::bytea, 'sha1'), 'hex') FROM xtmp;

Not solved, is not the same hash tham original (the "570b13fb..." in my example)... Perhaps the ::text enforced internal representation with \n symbols, so a solution will be direct cast to bytea, but it is an invalid cast. The other workaround also not is a solution,

SELECT encode(digest( replace(x::text,'\n',E'\n')::bytea, 'sha1' ), 'hex') 
FROM xtmp

... I try CREATE TABLE btmp (x bytea) and COPY btmp FROM '/tmp/test.xml' ( FORMAT binary ), but error ("unknown COPY file signature").

what's the problem with text fields in p.1 - sha1sum matches all seems fine doesn't it? — Oleg Kuralenko, Jan 24 '18 at 20:36
Hi @ffeast, I edited example 1. As I try to describe (sorry my English), I need a "pure `psql` or SQL`" per-field solutions... And example 3 shows that there are a problem also with internal representation, to calculate by SQL equivalent SHA1 (of external file). — Peter Krauss, Jan 25 '18 at 10:05
okay, then why not just use COPY ttmp (x) from '/tmp/test.xml' (FORMAT text); to copy only a single field? — Oleg Kuralenko, Jan 25 '18 at 18:49
Hi @ffeast, sorry, editing again with the word "row" to avoid confusion. PS: if you have an obvious soution, it is a solution (!), I am not working with it now, so I may have missed something obvious... About your suggestion, it not seems that you tested: the COPY comand not works, each `\n` of the external file will be interpreted as a new row, so many rows in one field is not a solution (is only my illustration of a workaround). — Peter Krauss, Jan 25 '18 at 19:30

score 2 · Answer 1 · answered Jan 26 '18 at 05:13

COPY isn't designed for this. It's meant to deal with table-structured data, so it can't work without some way of dividing rows and columns; there will always be some characters which COPY FROM interprets as separators, and for which COPY TO will insert some escape sequence if it finds one in your data. This isn't great if you're looking for a general file I/O facility.

In fact, database servers aren't designed for general file I/O. For one thing, anything which interacts directly with the server's file system will require a superuser role. If at all possible, you should just query the table as usual, and deal with the file I/O on the client side.

That said, there are a few alternatives:

The built-in pg_read_file() function, and pg_file_write() from the adminpack module, provide the most direct interface to the file system, but they're both restricted to the cluster's data directory (and I wouldn't recommend storing random user-created files in there).
lo_import() and lo_export() are the only built-in functions I know of which deal directly with file I/O and which have unrestricted access to the server's file system (within the constraints imposed by the host OS), but the Large Object interface is not particularly user-friendly....
If you install the untrusted variant of a procedural language like Perl (plperlu) or Python (plpythonu), you can write wrapper functions for that language's native I/O routines.
There isn't much you can't accomplish via COPY TO PROGRAM if you're determined enough - for one, you could COPY (SELECT 1) TO PROGRAM 'mv <source_file> <target_file>' to work around the limitations of pg_file_write() - though this blurs the line between SQL and external tools somewhat (and whoever inherits your codebase will likely not be impressed...).

Hi Nick, the I/O with `pg_*_file()` functions seems perfect (!). How to use it? I try, after `CREATE EXTENSION pgcrypto`, to do `select pg_catalog.pg_file_write('/tmp/xtmp.xml', x::text, false) FROM xtmp` (see my examples), but error "invalid path". If the `/tmp` (the free-folder for any application!) is invalid, no use is possible, and the solution is not useful. — Peter Krauss, Jan 26 '18 at 20:45
@Peter: Like I said, that function can only write to your server's `data/` directory, but it looks pretty straightforward to write such a function yourself in PL/Python: https://stackoverflow.com/a/39875621 — Nick Barnes, Jan 26 '18 at 23:00
Thanks Nick, the bounty goes to you (!), but only the first two items are in the scope of the question, and **it is not working**... And perhaps the answer not exist, for a so rigid question. Its easy to use `sed`, Perl, Python, etc. as I explained in the question, the solution must avoid it. A not-so-rigid question perhaps goes in the direction of a *migration package* like [pgloader](https://pgloader.io/). In a more flexible way, opening the question to "any external language", "any external package", etc. there are infinite other solutions. — Peter Krauss, Jan 27 '18 at 21:48
@Peter: Sorry, I thought that by "external tools" you meant tools run outside of the SQL script (like your `perl` example). I figured anything in the standard [contrib modules](https://www.postgresql.org/docs/current/static/contrib.html) was fair game. Note that `pg_file_write()` also requires one of these extensions. — Nick Barnes, Jan 27 '18 at 23:40

score 1 · Accepted Answer · edited Jan 28 '18 at 22:21

You can use plpythonu f.open(), f.write(), f.close() within a postgres function to write to a file.

Language extension would need to be installed.,

https://www.postgresql.org/docs/8.3/static/plpython.html

Working example from the mailing list. https://www.postgresql.org/message-id/flat/20041106125209.55697.qmail%40web51806.mail.yahoo.com#20041106125209.55697.qmail@web51806.mail.yahoo.com

for example plpythonu

CREATE FUNCTION makefile(p_file text, p_content text) RETURNS text AS $$
  o=open(args[0],"w")
  o.write(args[1])
  o.close()
  return "ok"
$$ LANGUAGE PLpythonU;

PS: for safe implementation see this example.

Preparing

There are a not-so-obvious procedure to use PLpython extension. Supposing an UBUNTU server:

On SQL check SELECT version().
On terminal check sudo apt install postgresql-plpython listed versions.
Install the correct version, eg. sudo apt install postgresql-plpython-9.6.
Back to SQL do CREATE EXTENSION plpythonu.

Testing

The /tmp is default, to create or use other folder, eg. /tmp/sandbox,
use sudo chown postgres.postgres /tmp/sandbox.

Suppose the tables of the question's examples. SQL script, repeating some lines:

  DROP TABLE IF EXISTS ttmp; 
  DROP TABLE IF EXISTS xtmp; 

  CREATE TABLE ttmp (x text);
  COPY ttmp FROM '/tmp/sandbox/original.xml' ( FORMAT text );
  COPY (SELECT x FROM ttmp) TO '/tmp/sandbox/test1-good.xml' (format TEXT);

  CREATE TABLE xtmp (x xml);
  INSERT INTO  xtmp (x) 
     SELECT array_to_string(array_agg(x),E'\n')::xml FROM ttmp
  ;

  COPY (select x::text from xtmp) 
  TO '/tmp/sandbox/test2-bad.xml' ( FORMAT text );

  SELECT makefile('/tmp/sandbox/test3-good.xml',x::text) FROM xtmp;

The sha1sum *.xml output of my XML original file:

4947..  original.xml
4947..  test1-good.xml
949f..  test2-bad.xml
4947..  test3-good.xml

Thanks Mokadillion (!)... I was flexibilized the bounty of the question for a "not so strict" interpretation (as my commets at [Nick's answer](https://stackoverflow.com/a/48456116/287948))... But sorry, can't use your answer for the bounty, it **ended 16 hours ago**. Thanks the links: since 2004 or before it was a "solution" for PostgreSQL users (and in nowadays its easy to `CREATE EXTENSION plpythonu`). If you allow I will edit your question to complement it with example (compatible with my question's examples). — Peter Krauss, Jan 28 '18 at 11:32
@Peter: The first `COPY` will fail if your file contains any tabs, and the checksums won't match if it contains any backslashes. You might also mangle the line breaks, depending on your platform. You really need a Python function to read the file as well. — Nick Barnes, Feb 08 '18 at 08:36
Hi @NickBarnes, all the workarounds is because I explain that **PostgreSQL COPY is an ugly command** because there are no option to save as it is, a string without backslashes (or to read a full-file into only one row). The Python, Perl, `sed` or any external workaround will be there only for it, to do the trivial task to save without backslashes. All tests are only a way to demonstrate how moch ugly is, and to compare as a proof that Python (or other) is a **reliable** process. — Peter Krauss, Feb 08 '18 at 17:19

How to export full-text files with SQL?

All examples

2 Answers2

Preparing

Testing

Linked