4

I have a 30 GB tab separated text file which has more than 100 million rows, when I want to import this text file to a PostgreSQL table using \copy command, some rows cause error. how can I ignore those rows and also take a record of the ignored rows while importing to postgresql?

I connect to my machine by SSH so I can not use pgadmin!

it's very hard to edit the text file before importing because so many different rows have different problems. if there exists a way to check the rows one by one before importing and then run the \copy command for individual rows, it would be helpful.

Below is the code which generates the table:

CREATE TABLE Papers(
    Paper_ID CHARACTER(8) PRIMARY KEY,
    Original_paper_title TEXT,
    Normalized_paper_title TEXT,
    Paper_publish_year INTEGER, 
    Paper_publish_date DATE,
    Paper_Document_Object_Identifier TEXT,
    Original_venue_name TEXT,
    Normalized_venue_name TEXT,
    Journal_ID_mapped_to_venue_name CHARACTER(8),
    Conference_ID_mapped_to_venue_name CHARACTER(8),
    Paper_rank BIGINT,
    FOREIGN KEY(Journal_ID_mapped_to_venue_name) REFERENCES Journals(Journal_ID),
    FOREIGN KEY(Conference_ID_mapped_to_venue_name) REFERENCES Conferences(Conference_ID));
Ramin Zahedi
  • 455
  • 2
  • 6
  • 20
  • Ok. Flag Retracted. – Mohammad Yusuf Dec 26 '16 at 14:20
  • 1
    Possible duplicate of [How to ignore errors with psql \copy meta-command](http://stackoverflow.com/questions/36634360/how-to-ignore-errors-with-psql-copy-meta-command) – dmfay Dec 26 '16 at 18:27
  • @dmfay its very hard to edit the text file before importing because so many different rows have different problems. if there exists a way to check the rows one by one before importing and then run the \copy command for individual rows, it would be helpful. – Ramin Zahedi Dec 28 '16 at 05:49

4 Answers4

3

Don't load directly to your destination table but to a single column staging table.

create table Papers_stg (rec text);

Once you have all the data loaded you can the do verifications on the data using SQL.

Find records with wrong number of fields:

select  rec
from    Papers_stg
where   cardinality(string_to_array(rec,'       ')) <> 11

Create a table with all text fields

create table Papers_fields_text
as
select  fields[1]  as Paper_ID                          
       ,fields[2]  as Original_paper_title              
       ,fields[3]  as Normalized_paper_title            
       ,fields[4]  as Paper_publish_year                
       ,fields[5]  as Paper_publish_date                
       ,fields[6]  as Paper_Document_Object_Identifier  
       ,fields[7]  as Original_venue_name               
       ,fields[8]  as Normalized_venue_name             
       ,fields[9]  as Journal_ID_mapped_to_venue_name   
       ,fields[10] as Conference_ID_mapped_to_venue_name
       ,fields[11] as Paper_rank                        

from   (select  string_to_array(rec,'       ')  as fields
        from    Papers_stg
        ) t
where   cardinality(fields) = 11

For fields conversion checks you might want to use the concept described here

Community
  • 1
  • 1
David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88
  • How can I import data from my 11 column tab separated text file to a single column table? – Ramin Zahedi Dec 29 '16 at 06:19
  • Use a character that does not exist in your text file as a delimiter – David דודו Markovitz Dec 29 '16 at 06:35
  • 1
    it's a 30 GB text file consists of many languages. any character can exist. isn't there a more accurate way for importing from 11 columns file to 1 column table? – Ramin Zahedi Dec 29 '16 at 08:25
  • can '\n' be the delimiter? – Ramin Zahedi Dec 29 '16 at 08:29
  • You cannot use '\n' but you can use control characters such as SUBSTITUTE (hex = 1A), e.g. `\copy t from c:\Temp\t.txt with delimiter E'\x1A'` – David דודו Markovitz Dec 29 '16 at 10:43
  • You are right about the delimiter. Unfortunately, I encountered another problem I asked here: http://stackoverflow.com/questions/41379067/how-to-import-from-a-mixed-encoding-file-to-a-postgresql-table would you mind take a look? – Ramin Zahedi Dec 29 '16 at 11:58
  • I encountered a problem, this: where cardinality(string_to_array(rec,' ')) <> 11 is true for all the rows!!! so No row goes to Papers_fields_text !!! are you sure about string_to_array(rec,' ') ? Please note that for importing all data to one column staging table I exactly used you answer to this question: http://stackoverflow.com/questions/41379067/how-to-import-from-a-mixed-encoding-file-to-a-postgresql-table – Ramin Zahedi Dec 30 '16 at 08:56
  • Just look on 1 row and see how it is splited. Verify that you are using tab – David דודו Markovitz Dec 30 '16 at 08:58
  • what about string_to_array(rec,'\t') instead? – Ramin Zahedi Dec 30 '16 at 09:01
  • the string_to_array(rec,' \t') also didn't work! and yes it's a tab separated file. – Ramin Zahedi Dec 30 '16 at 09:16
  • I meant that just use a TAB character as a deliter. You can copy it from notepad or somthing like that – David דודו Markovitz Dec 30 '16 at 09:20
  • I just pressed Tab on my keyboard at gedit and copy paste it to the command at shell. the shell didn't show any changes it was looking like I'm mentioning two quote as delimiter. any way I ran the command and still doesn't work! – Ramin Zahedi Dec 30 '16 at 09:38
  • Please take a look here: http://stackoverflow.com/questions/41394050/what-delimiter-should-be-used-for-string-to-array-command-at-postgresql-for-a-ta – Ramin Zahedi Dec 30 '16 at 09:56
  • Your editor might replace tab with space. Try `chr(9)` instead – David דודו Markovitz Dec 30 '16 at 10:40
0

Your only option is to use row-by-row processing. Write shell script (for example) that will loop thru input file and send each row to "copy" then check execution result, then write failed rows to some "err_input.txt".

More complicated logic can increase processing speed. Using "portions" instead of row-by-row and use row-by-row logic on failed segments.

Ghost
  • 31
  • 1
0

Consider using pgloader

Check BATCHES AND RETRY BEHAVIOUR

David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88
0

You could use an BEFORE INSERT - trigger and check your criteria. If the record fails the check, write a log (or an entry into a separate table) and return null. You could even correct some values, if possible and feasible.

Of course, if checking criteria requires other queries (like finding duplicate keys etc.), you might get a performance issue. But I'm not sure which kind of "different problems in different rows" you mean...

Confer also an answer on StackExchange Database Administrators, and the following example taken from Bartosz Dmytrak at PostgreSQL forum:

CREATE OR REPLACE FUNCTION "myschema"."checkTriggerFunction" ()
RETURNS TRIGGER
AS
$BODY$
BEGIN
IF EXISTS (SELECT 1 FROM "myschema".mytable WHERE "MyKey" = NEW."MyKey")
THEN
 RETURN NULL;
ELSE
 RETURN NEW;
END IF;
END;
$BODY$
LANGUAGE plpgsql;

and trigger:
CREATE TRIGGER "checkTrigger"
  BEFORE INSERT
  ON "myschema".mytable
  FOR EACH ROW
  EXECUTE PROCEDURE "myschema"."checkTriggerFunction"();
Community
  • 1
  • 1
Stephan Lechner
  • 34,891
  • 4
  • 35
  • 58