0

i'm trying to determine the amount of line in a CSV-File which i'm about to import via external table. i need to know the total amount of lines to determine within my procedure to react accordingly.

question is, what is the fastest way to do that?

RootOfProblem
  • 357
  • 1
  • 3
  • 11
  • 1
    The fastest way is to not use oracle. Can you use command line tools? – Bohemian Jun 02 '16 at 11:12
  • yes i can, but for complience reasons was thinking about a way to solve it within oracle – RootOfProblem Jun 03 '16 at 11:28
  • "compliance"? With what? Oracle is less secure than anything else you'd be using. In contrast, the command line is very stable, very well understood, very secure and very well administered. – Bohemian Jun 03 '16 at 11:59
  • sure, you are right but part of the current architecture is work with oracle scheduler and therefor the use of shell scripts is limited, but my answer below kinda reflect a possibility to make use of shell scripts. Is that somehow what you had in mind or do you have another opinion on that matter? i would like to find the most efficient way at least copy some party from it... – RootOfProblem Jun 03 '16 at 20:57
  • Don't use the database for non database stuff, *especially* scheduling! It's all explained in [this answer](http://stackoverflow.com/a/14805785/256196). The same argument holds for your task of counting lines. It's an operational concern, not a database concern, so don't use the database. – Bohemian Jun 04 '16 at 01:06

2 Answers2

1

The two obvious options are to just count the rows from the external table, or to read the file line by line and count those. With small files there isn't much difference, and the line-by-line approach can be slight faster; but with bigger files the native nature of the external table query makes it significantly faster.

But you may be able to use the file size as an approximation for the number of lines - it depends how accurate you need to be. You can do that directly, or via a BFILE, and from a BFILE you can load a CLOB more quickly that UTL_FILE reads the file - and can then count lines by looking for newline characters.

Test block:

declare
  l_time pls_integer;
  l_count pls_integer;
  l_file utl_file.file_type;
  l_line varchar2(32767);
  l_bfile bfile;
  l_clob clob;
  l_dest_offset  pls_integer := 1;
  l_src_offset   pls_integer := 1;
  l_lang_context pls_integer := dbms_lob.default_lang_ctx;
  l_warning      pls_integer;
  l_exists boolean;
  l_file_length pls_integer;
  l_block_size pls_integer;
begin
  l_time := dbms_utility.get_cpu_time;
  select count(*) into l_count from t42_ext;
  dbms_output.put_line('external table:     count ' || l_count
    || ' took ' || (dbms_utility.get_cpu_time - l_time));

  l_time := dbms_utility.get_cpu_time;
  l_count := 0;
  l_file := utl_file.fopen('D42', 'root.dat', 'r', 32767);
  loop
    begin
      utl_file.get_line(l_file, l_line, 32767);
    exception
      when no_data_found then
        exit;
    end;
    l_count := l_count + 1;
  end loop;
  utl_file.fclose(l_file);
  dbms_output.put_line('utl_file read loop: count ' || l_count
    || ' took ' || (dbms_utility.get_cpu_time - l_time));

  l_time := dbms_utility.get_cpu_time;
  utl_file.fgetattr('D42', 'root.dat', l_exists, l_file_length, l_block_size);
  dbms_output.put_line('utl_file fgetattr:  size  ' || l_file_length
    || ' took ' || (dbms_utility.get_cpu_time - l_time));

  l_time := dbms_utility.get_cpu_time;
  l_bfile := bfilename('D42', 'root.dat');
  dbms_output.put_line('bfile getlength:    size  ' || dbms_lob.getLength(l_bfile)
    || ' took ' || (dbms_utility.get_cpu_time - l_time));

  l_time := dbms_utility.get_cpu_time;
  dbms_lob.open(l_bfile, dbms_lob.lob_readonly);
  dbms_lob.createtemporary(l_clob, false);
  dbms_lob.loadclobfromfile(l_clob, l_bfile, dbms_lob.getLength(l_bfile),
    l_dest_offset, l_src_offset, dbms_lob.default_csid, l_lang_context, l_warning );
  dbms_lob.close(l_bfile);
  dbms_output.put_line('clob getlength:     size  ' || dbms_lob.getLength(l_bfile)
    || ' took ' || (dbms_utility.get_cpu_time - l_time));

  l_time := dbms_utility.get_cpu_time;
  dbms_output.put_line('clob length diff:   count '
    || (length(l_clob) - length(replace(l_clob, chr(10))))
    || ' took ' || (dbms_utility.get_cpu_time - l_time));

  l_time := dbms_utility.get_cpu_time;
  dbms_output.put_line('clob regexp_count:  count '
    || regexp_count(l_clob, chr(10))
    || ' took ' || (dbms_utility.get_cpu_time - l_time));
  dbms_lob.freetemporary(l_clob);
end;
/

Running against two sample files (11gR2 on Linux):

PL/SQL procedure successfully completed.

external table:     count 1024 took 1
utl_file read loop: count 1024 took 1
utl_file fgetattr:  size  189440 took 0
bfile getlength:    size  189440 took 0
clob getlength:     size  189440 took 2
clob length diff:   count 1024 took 0
clob regexp_count:  count 1024 took 22

PL/SQL procedure successfully completed.

external table:     count 1048576 took 85
utl_file read loop: count 1048576 took 1403
utl_file fgetattr:  size  193986560 took 0
bfile getlength:    size  193986560 took 0
clob getlength:     size  193986560 took 742
clob length diff:   count 1048576 took 374
clob regexp_count:  count 1048576 took 21808

There may be some effects from operating system and/or storage device file caches too. But based on this test, checking the file size is quickest, followed fairly closely by doing a native count of the rows in the external table. Counting the line breaks in a loaded CLOB is significantly slower, and a line-by-line count via UTL_FILE is significantly slower still (not surprisingly). Using a regular expression count of a CLOB is too slow to consider, even for smaller files really.

You should test on your environment and with realistic data though.

Alex Poole
  • 183,384
  • 11
  • 179
  • 318
0

i solved it the following way

  1. i implemented a shell script (rowcount.sh) that resides on the db server with the following content

    /usr/bin/cat $1 | /usr/bin/wc -l
    
  2. created a directory inside oracle

    CREATE DIRECTORY exec_dir AS '/home/oracle/';
    GRANT READ, EXECUTE ON DIRECTORY exec_dir TO stc;
    
  3. implemented the following function to make use of external tables and the functionality of the preprocesser.

    CREATE OR REPLACE PROCEDURE  file_row_count(filename IN VARCHAR2, line_count OUT NUMBER)
    IS
        l_random_table_name VARCHAR2(30):= 'import_'||SUBSTR(SYS_GUID(),1,15);
        l_count NUMBER;
    BEGIN
        EXECUTE IMMEDIATE 'CREATE TABLE '||l_random_table_name||'(
                                                line_count NUMBER
                                            )
                                            ORGANIZATION EXTERNAL (
                                                     TYPE ORACLE_LOADER
                                                     DEFAULT DIRECTORY EXT_CSV_DIR
                                                     ACCESS PARAMETERS ( 
                                                             RECORDS DELIMITED BY NEWLINE
                                                             PREPROCESSOR exec_dir:''rowcount.sh''
                                                             NOBADFILE
                                                             NODISCARDFILE
                                                             NOLOGFILE
                                                             FIELDS
                                                        )
                                                     LOCATION (EXT_CSV_DIR:'''||filename||''')
                                                )
                                            REJECT LIMIT UNLIMITED
                                            NOPARALLEL
                                            NOMONITORING'
        ;
    
        EXECUTE IMMEDIATE 'SELECT LINE_COUNT FROM '||l_random_table_name INTO line_count; 
        EXECUTE IMMEDIATE 'DROP TABLE '||l_random_table_name;
    EXCEPTION
        WHEN OTHERS THEN 
            dbms_output.put_line(to_char(SQLERRM));
    
            dbms_output.put_line('try to drop temp table'||l_random_table_name);
            SELECT COUNT(*) INTO l_count
            FROM SYS.USER_TABLES
            WHERE table_name = UPPER(l_random_table_name)
            ;
    
            IF l_count > 0 THEN 
                EXECUTE IMMEDIATE 'DROP TABLE '||l_random_table_name;
            END IF;
    END file_row_count;
    /
    
  4. make use of the function for files you know you have to import

    DECLARE
        L_FILENAME VARCHAR2(200);
        L_LINE_COUNT NUMBER;
    BEGIN
        L_FILENAME := 'data0.csv';
    
        FILE_ROW_COUNT(
            FILENAME => L_FILENAME,
            LINE_COUNT => L_LINE_COUNT
        );
    
        DBMS_OUTPUT.PUT_LINE('LINE_COUNT = ' || L_LINE_COUNT);
    
    --rollback; l
    END;
    
RootOfProblem
  • 357
  • 1
  • 3
  • 11