2

I want to extract information from a text file (almost 1GB) and store it in PostgreSQL database. Text file is in following format:

DEBUG, 2017-03-23T10:02:27+00:00, ghtorrent-40 -- ghtorrent.rb:Repo EFForg/https-everywhere exists
DEBUG, 2017-03-24T12:06:23+00:00, ghtorrent-49 -- ghtorrent.rb:Repo Shikanime/print exists
...

and I want to extract 'DEBUG', timestamp, 'ghtorrent-40', 'ghtorrent' and "Repo EFForg/https-everywhere exists" from each line and store it in database.

I have done it in using other languages like python (psycopg2) and C++ (libpqxx) but is it possible to write a function in PostgreSQL itself to import the whole data itself.

I am currenly using pgAdmin4 tool for the PostgreSQL. I thinking of using something like pg_read_file in function to read the file but one line at a time and insert it into the table.

  • please check https://www.pgadmin.org/docs/pgadmin4/development/import_export_data.html – dassum Jan 04 '20 at 05:26
  • @dassum i don't think this will work as it only allow me to split the data fields using single delimiter while here i have to split using "," , "--" , ".rb: " and also remove trailing spaces and reformat the timestamp string for 'timestamp with zone' format. – Sagar Mahour Jan 04 '20 at 05:42
  • 2
    I would first load the file into a staging table with a single column that stores the entire line using `copy` - but that requires you can upload the input file to the databases server. Otherwise you can use `\copy` in `psql` (I don't know if pgAdmin offers something similar). Once it's in the database, you can use e.g. `string_to_array()` or `split_part()` to break up the line into individual fields and insert the result of that into the final table. –  Jan 04 '20 at 08:39

1 Answers1

3

An approach I use with my large XML files - 130GB or bigger - is to upload the whole file into a temporary unlogged table and from there I extract the content I want. Unlogged tables are not crash-safe, but are much faster than logged ones, which totally suits the purpose of a temporary table ;-)

Considering the following table ..

CREATE UNLOGGED TABLE tmp (raw TEXT);

.. you can import this 1GB file using a single psql line from your console (unix)..

$ cat 1gb_file.txt | psql -d db -c "COPY tmp FROM STDIN" 

After that all you need is to apply your logic to query and extract the information you want. Depending on the size of your table, you can create a second table from a SELECT, e.g.:

CREATE TABLE t AS
SELECT 
  trim((string_to_array(raw,','))[1]) AS operation,
  trim((string_to_array(raw,','))[2])::timestamp AS tmst,
  trim((string_to_array(raw,','))[3]) AS txt
FROM tmp
WHERE raw LIKE '%DEBUG%' AND
      raw LIKE '%ghtorrent-40%' AND 
      raw LIKE '%Repo EFForg/https-everywhere exists%'

Adjust the string_to_array function and the WHERE clause to your logic! Optionally you can replace these multiple LIKE operations to a single SIMILAR TO.

.. and your data would be ready to be played with:

SELECT * FROM t;

 operation |        tmst         |                               txt                                
-----------+---------------------+------------------------------------------------------------------
 DEBUG    | 2017-03-23 10:02:27 | ghtorrent-40 -- ghtorrent.rb:Repo EFForg/https-everywhere exists
(1 Zeile)

Once your data is extracted you can DROP TABLE tmp; to free some disk space ;)

Further reading: COPY, PostgreSQL array functions and pattern matching

Jim Jones
  • 18,404
  • 3
  • 35
  • 44