1

I have millions of rows of GPS data to be copied into a PostgreSQL table.

20110228,235958,QS,15287,121.533272,31.180235,42.3,163,0,1,2011-03-01 00:00:00;
20110228,235958,QS,14715,121.430813,31.199277,36.2,54,1,1,2011-03-01 00:00:00;
20110228,235958,QS,15728,121.400303,31.259683,33,222,0,1,2011-03-01 00:00:00;
20110228,235958,QS,17857,121.525747,31.20636,6.1,152,0,1,2011-03-01 00:00:00;
20110228,235959,QS,10308,121.376895,31.116863,0,325,0,1,2011-03-01 00:00:00;
...

I create a table as follows first

EXECUTE 'CREATE TABLE ' ||taxiDataTableName|| ' (
            serial_id SERIAL NOT NULL PRIMARY KEY,
            rec_date INT,           -- record date
            rec_time INT,           -- record time 
            company CHAR(5),        -- taxi company name
            taxi_id SMALLINT,       --  taxi ID
            long double precision,  -- longitude, E
            lat double precision,   -- latitude, N
            speed REAL,             -- taxi speed, km/h
            bearing SMALLINT,       -- direction of speed
            status SMALLINT,        -- 0: empty, 1:loaded
            infoNo SMALLINT,        -- infoNo before time stamp information
            info CHAR(25)           -- time stamp information
        )';

I then run the following command to import data

EXECUTE 'COPY '||taxiDataTableName||' (rec_date,rec_time,company,taxi_id,long,lat,speed,bearing,status,infoNo,info) 
            FROM '''||datadir||datafilename||''' (DELIMITER('',''))';

Everything works fine until the row 730745

ERROR: value "00155301001555" is out of range for type integer
SQL state: 22003
Context: COPY _20110301_rawdata, line 730745, column rec_time: "00155301001555"
SQL statement "COPY _20110301_rawdata (rec_date,rec_time,company,taxi_id,long,lat,speed,bearing,status,infoNo,info) 
            FROM 'G:/PostgreSQL_Data_Repo/Taxi_GPS_Data/201103/20110301.txt' (DELIMITER(','))"
PL/pgSQL function load_data_to_table(integer[]) line 68 at EXECUTE

Having checked the data, the row 730745 is

20110301,00155301001555,QS,14834,121.684833,31.040167,0,177,0,1,2011-03-01 00:15:55;

It's obvious that the record time 00155301001555 (2nd column) is erroneous. I think I could change the type of rec_time to BIGINT and import data regardless of the erroneous data and filter outliers, but this may make the database file size larger, so can PostgreSQL set the out-of-range value (violating the specified data type in table creation) to null?

Guoyang Qin
  • 295
  • 2
  • 13
  • @VaoTsun the row 730744 is `20110301,001553,QS,18674,121.528057,31.271927,0,232,0,1,2011-03-01`, the second column indicates the time 00:15:53. – Guoyang Qin May 02 '17 at 13:06
  • I'm not saying it's impossible but I don't think you can do this in pure PostgresSQL. See this answer: http://stackoverflow.com/a/36635312/2232127 A more viable option would be to run it through a script first which filters out the offending data. – JensV May 02 '17 at 13:07
  • @VaoTsun as of `info CHAR(25)` that is a field I do not want, so I take it just as a string and I will drop the column after the `copy` finishes. – Guoyang Qin May 02 '17 at 13:07
  • you can't manipulate rows while copying. moving int to bigint for one column won't dramatically change the size (you might need to vacuum full it though or rebuild) - don't mind my previous questions - I missed you were listing columns on `COPY` – Vao Tsun May 02 '17 at 13:26
  • @JensV: thanks for your link that I know someone proposed a new feature 'Error logging in COPY ' (https://wiki.postgresql.org/wiki/Error_logging_in_COPY), but unfortunately, 'it was submitted and reviewed but not accepted into the core product for that or any other version so far.' I thought I will write a cleaning script before. – Guoyang Qin May 02 '17 at 13:41
  • @VaoTsun I just tried to import the data using MATLAB with format specified as `'%d %d %s %d %f %f %f %d %d %d %s';` Unsurprisingly, it has reset the out-of-range `00155301001555` to `2147483647`. How I wish the PostgreSQL will do like this, but it will not work so far. I will follow your suggestion to do the data cleaning. – Guoyang Qin May 02 '17 at 13:45
  • you can mock-up what MATLAB did! Just copy all into a temp table with text fields and then `insert to real one select [logic] from temp` – Vao Tsun May 02 '17 at 13:54
  • @VaoTsun Yep, creating a temp table is always the final solution. I am doing so as you suggest, I hope it will not lengthen the run time too much. – Guoyang Qin May 02 '17 at 14:01

1 Answers1

1

Use file_fdw.

Example (the file contains just two rows with two columns):

create extension if not exists file_fdw;

create server csv_server foreign data wrapper file_fdw;

create foreign table my_file (
    rec_date int,
    rec_time bigint
) server csv_server
options (filename '/data/my_file.csv', format 'csv');

select *
from my_file;

 rec_date |   rec_time   
----------+--------------
 20110228 |       235958
 20110301 | 155301001555
(2 rows)

Convert the column to int:

select 
    rec_date, 
    case when rec_time > x'7fffffff'::int then null else rec_time::int end
from my_file;

 rec_date | rec_time 
----------+----------
 20110228 |   235958
 20110301 |         
(2 rows)
klin
  • 112,967
  • 15
  • 204
  • 232
  • Looks cool. Is this foreign table solution equivalent to creating a temp table and select with conditions? – Guoyang Qin May 02 '17 at 16:12
  • Roughly speaking the foreign table uses temporary table and COPY. There might be some optimisations, I don't know implementation details. – klin May 02 '17 at 16:24