How can I get data from a certain column of a .tsv file into a Postgres array column? Is regex the right tool, or should I look for another approach?

Question

Overview :

1) I have a .TSV [tab-separated values, flat file] dump that I need to shunt into an existing Postgres table [table not of my design, but I do have direct db access] I intend to use psql \copy to accomplish this;

2) The table in question has one column defined as a one-dimensional array [historical reasons over which I have no control; i know that denormalized columns are far from optimal in most cases]

3) In the TSV, the fifth column contains the values that are destined for the Postgres array column; these values are separated by commas

4) In order to get these values into the postgres array column, I believe I need to wrap them with the following pattern: ‘{}’ -- such that values : foo,bar become ‘{foo,bar}’

My guess is that the best way to go about this is with a regex, but my skills in that area are currently very weak [working on it, have the owl book and the regex cookbook!]. I’ve read several related SO questions/answers, skimmed rexegg and regex101, but can’t find information here or elsewhere that describes a similar-enough situation that I can use the solutions / approaches therein.

There are a few particular aspects of this problem that I am stuck on:

a) Since the data in each column is of variable length [i.e., each value could be any length], I can’t figure out how to identify the fifth column [in order to act on it];

b) Similarly, I need to append ‘{ to the beginning of the column and }’ to the end of it, but since the data within the column is variable in content and length, I can’t figure out how to handle that -- e.g., I can’t tell the regex to look for any particular character or length to trigger the addition of the brackets/quotes in the correct position

Below is an example row from the .tsv file; the fifth column with the comma separating the two values is the one I’m trying to act on.

1234    e@mail.addy 43210   0123456789  foo_value,bar_value 107.00 0.00 timestamp_1 timestamp_2 54321   string_2    string_3    timestamp_3 98765   12345   US  Hawaii  string_4    string_5    string_6    string_7    string_8    false true  false

when the regex [or whatever transformation works] is done with it, it should [I think] look like this:

1234    e@mail.addy 43210   0123456789  ’{foo_value,bar_value}’ 107.00 0.00 timestamp_1 timestamp_2 54321   string_2    string_3    timestamp_3 98765   12345   US  Hawaii  string_4    string_5    string_6    string_7    string_8    false true  false

so that the Postgres array column will recognize the input as a valid array.

Here’s the definition of the Postgres table :

CREATE TABLE postgres_table (
  col1 SERIAL PRIMARY KEY,
  col2 TEXT,
  col3 TEXT, 
  col4 TEXT,
  col5 TEXT[], /*this is the array column*/
  col6 NUMERIC(19,2) NOT NULL,
  col7 NUMERIC(19,2),
  col8 TIMESTAMP WITHOUT TIME ZONE,
  col9 TIMESTAMP WITHOUT TIME ZONE,
  col10 TEXT,
  col11 TEXT,
  col12 TEXT,
  col13 TIMESTAMP WITHOUT TIME ZONE DEFAULT now(),
  col14 TEXT,
  col15 TEXT,
  col16 TEXT,
  col17 TEXT,
  col18 TEXT,
  col19 TEXT,
  col20 TEXT,
  col21 TEXT,
  col22 TEXT,
  col23 BOOLEAN,
  col24 BOOLEAN,
  col25 BOOLEAN
);

additional notes : the db is Postgres 11; all systems involved are RHEL-based; I know a bit of JavaScript and SQL but my attempts thus far have been in bash, I’ve successfully used \copy for many .csv-type dumps in the past [but never to a denormalized table], and I thought the below particular SO question would get me most of the way there, but I get the error “substitution failed” when I try to adapt the solution to my situation: How can i capture all data from a certain column?

To clarify : I’m eager to use whatever approach works, so if there is a better way to go about this than regex, I’m all ears! All guidance is much appreciated.

EDITS:

Much thanks to @James Brown and @jjanes -- I don't have enough 'reputation points' yet to upvote your answers, but will do for sure as soon as I can.

The awk solution from @James Brown below worked for me -- the processed file contained a trailing newline that I had to zap with the awk script from this unix.SE question: https://unix.stackexchange.com/questions/140727/how-can-i-delete-a-trailing-newline-in-bash

I'm stoked to have an impetus to learn some more awk; what a great tool.

UPDATE : I'm still working on getting @jjanes solution working for my own edification -- still running into permissions issues after the GRANT TEMP approach, but I suspect this is something to do with the way RDS [where the db is hosted] is handling remote psql requests -- will update again when I have it working // -- I tried @jjanes solution below, but ran into the following problemo at the \copy step: ERROR: permission denied for schema pg_temp_5

I’ve tried : GRANT USAGE ON SCHEMA pg_temp_5 TO user1; GRANT ALL ON SCHEMA pg_temp_5 TO user1; ALTER TABLE pg_temp_5.foobar OWNER TO user1; as suggested in the following SF question, but no dice https://serverfault.com/questions/488669/postgres-insert-error-permission-denied-for-schema-public

I can’t see any reason why this shouldn’t work, a real head-scratcher.

You have to grant TEMPORARY on the database level, `grant TEMP on to user1`. The exact name of a temp schema is volatile, so it can't be granted by schema name. — jjanes, Oct 08 '19 at 19:04

score 1 · Answer 1 · answered Oct 08 '19 at 00:56

1

Load to a temp table as text, then convert it.

create temp table foobar (like postgres_table);
alter table foobar alter col5 type text;
\copy foobar from whatever.txt
insert into postgres_table select col1, col2, col3, col4, ('{'||col5||'}')::text[], col6.....from foobar;

This won't work if some kinds of punctuation occur in "col5".

answered Oct 08 '19 at 00:56

jjanes

37,812
5
27
34

This looks like an elegant solution, but for some reason postgres is denying access to the temp schema on \copy -- see edit to original question above -- – Mister October Oct 08 '19 at 17:12

James Brown · Accepted Answer · 2019-10-09T08:27:23.907

Using awk for it. Your data with a twist in the fifth field (I left the timestamp_[123]s there for you but fixed the missing tabs):

$ cat data
1234    e@mail.addy 43210   0123456789  foo_value,bar_value 107.00  0.00    timestamp_1 timestamp_2 54321   string_2    string_3    timestamp_3 98765   12345   US  Hawaii  string_4    string_5    string_6    string_7    string_8    false   true    false
12345   e@mail.addy 43210   0123456789  foo_value}bar_value 107.00  0.00    timestamp_1 timestamp_2 54321   string_2    string_3    timestamp_3 98765   12345   US  Hawaii  string_4    string_5    string_6    string_7    string_8    false   true    false

Add curly brackets with awk and escape pre-existing if any (above, second record, fifth field):

$ awk '
BEGIN {
    FS=OFS="\t"              # set input and output delimiters to a tab
}
NR==1 {                      # first record in file
    nf=NF                    # store field count
}
NF==nf {                     # process only records with the same field count as the first record
    gsub(/\{/,"\\{",$5)      # escape left curly brackets with a \
    gsub(/\}/,"\\}",$5)      # escape right curly brackets with a \
    $5="{" $5 "}"            # surround the fifth with curly brackets
    print                    # output
}' data > processed_data     # redirect output to another file

If you don't escape them, you'll get:

psql:bar.sql:1: ERROR:  malformed array literal: "{foo_value,bar}value}"
DETAIL:  Junk after closing right brace.
CONTEXT:  COPY postgres_table, line 2, column col5: "{foo_value,bar}value}"

Output:

$ cat processed_data
1234    e@mail.addy 43210   0123456789  {foo_value,bar_value}...
12345   e@mail.addy 43210   0123456789  {foo_value,bar\}value}...

\COPY script:

$ cat copy.sql
\COPY postgres_table(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20,col21,col22,col23,col24,col25) FROM 'processed_data' CSV DELIMITER E'\t';

Execute:

$ psql -h host -U user -f copy.sql database
Timing is on.
COPY 2
Time: 5.966 ms

Query that col5:

database=# select col5 from postgres_table;
          col5           
-------------------------
 {foo_value,bar_value}
 {foo_value,"bar}value"}
(2 rows)

Time: 3.388 ms

If you need to do more data processing before loading into the Postgres, look into the awk (GNU awk, especially). Awk is the language to preprocess text and GNU awk has an extension to connect to PostgresDB. It changed my way of ETL. — James Brown, Oct 08 '19 at 13:12
This worked a treat, big thanks! I actually had bought "The Awk Programming Language" last year but not got to it yet -- clearly, it's an essential tool! p.s. the file [whether due to awk's processing or some other factor, I don't know] ended up with a trailing newline that had me beating my head against the wall for a while; however, the awk script from this SO question took care of that and all was well: https://unix.stackexchange.com/questions/140727/how-can-i-delete-a-trailing-newline-in-bash — Mister October, Oct 08 '19 at 23:54
@MisterOctober Hmm, are you saying my script produced a trailing newline (meaning an empty line) which was not in the original data? Updated the awk to remove lines which have differing field count from the first line. — James Brown, Oct 09 '19 at 08:28

How can I get data from a certain column of a .tsv file into a Postgres array column? Is regex the right tool, or should I look for another approach?

2 Answers2