2

I am working on a daily automated job which checks the records in a text file against oracle database. We receive the text file everyday from an external team which contains records around 100,000. The text file will be in unix format which has 6 columns seperated by | symbol.

eg, HDR 1

home/sample/file|testvalue1|testvalue2|testval3|testval4|testval5

TRL

I need to check whether the values in testval3 and testval5 exists in my table in oracle database. The table has around 10 million records. I am currently processing it through shellscript. Inside the shellscript, i am reading the text file and iterating through each line in a loop. Inside the loop I am passing the value from each line and running the query against DB. If the records don’t exist in DB, I have to output those to a csv file. The below query is used:

select ‘testval3’,’testval5’ from dual 
where not exists (select primarykeycolumn 
from mytable where mycolumn=testval3 and mycolumn2=testval5)

Since the input file has 100000 entries, my loop will run the query 100000 times and every time it will check the table with 10 million records. This is making my batch job to run for many hours and I have to terminate it. Is there a better way to handle this situation? I can also use java if there is no better way to do this via shellscript.

Kaushik Nayak
  • 30,772
  • 5
  • 32
  • 45
Busybee
  • 23
  • 4
  • Are you using UTL_FILE?, which version of Oracle you are using? –  May 11 '18 at 04:35
  • It seems like you have chosen the least efficient way of doing it. Haven't you ever heard of SQL * Loader, external tables etc? My recommendation is first create a table in database with the 2 columns and load your file onto it using SQL* Loader. Then using a single query, you could extract all you need. External table and UTL_FILE requires directory permissions unlike loader. Go for it if you can get it easily. – Kaushik Nayak May 11 '18 at 04:50
  • @KaushikNayak He is perfectly aware that his way is inefficient, that's why he is asking a question here in the first place. You could give the same value just omitting the first two phrases. – Vinko Vrsalovic May 11 '18 at 04:54
  • @Busybee: I have posted an answer might be you are looking for something that. –  May 11 '18 at 05:00
  • @VinkoVrsalovic : How do you know it's "He"? . Furthermore, I don't know if such questions are rare under java tag, but, under Oracle, stuffs like SQL* Loader are basic things one would expect the OP to be aware of. A logic used in which we would require to connect database a million times doesn't impress me. So, the question, why is it being used at all. – Kaushik Nayak May 11 '18 at 05:07
  • @KaushikNayak I don't know their gender, used he as default. Anyway, we have all been newbies once, there's actually no value in rubbing in the face of newbies how newbies they are, they likely know it and it's good that they ask for help. We need to be more welcoming especially regarding new people on the site. That's all I'm saying. If you are offended by people not knowing the basics, then skip it! You don't know their cirucmstances, this person might have been tasked with this without being a DBA or a programmer at all, for example. – Vinko Vrsalovic May 11 '18 at 05:19
  • @VinkoVrsalovic : I generally want to end futile arguments and that's it. But, before I finish, all I can say is that If I wasn't welcoming to newbies, I wouldn't even have commented or provided all the solutions for free. – Kaushik Nayak May 11 '18 at 05:30
  • @kaushiknayak doing things for free has nothing to do with being welcoming. – Vinko Vrsalovic May 15 '18 at 08:26
  • @KaushikNayak Sorry, I haven't mentioned more details in my original question. The reason I went for shellscript job is that the original requirement was for around 1000 records and after it went from dev to production, the records start coming in 100000. Creating a staging table was not an option given to me in the beginning as it has to go through additional set of approvals which will take time. – Busybee May 17 '18 at 03:00
  • seems like the best way to do this should be through creating the staging table.. so I will try this approach. if I do not get the approval, i can go for Vinko Vrsalovic's approach. Thanks for the suggestions..! – Busybee May 17 '18 at 03:02
  • Busybee : I think you should really try and persuade the approvers / management to get it working for you because I feel, one should not compromise on efficiency and performance due to non-cooperation between module owners. You might end up with workaround solutions that are bound to cause u trouble in future. All the best! – Kaushik Nayak May 17 '18 at 03:12

2 Answers2

2

Below is one of the simple solution that will ensure that timeout will be not there and even you don't need to scan the millions of record for 100K times.

One time setup: Create a staging Temporary table:

create table a_staging_table(
testvalue1 varchar2(255),
testvalue2 varchar2(255),
testval3 varchar2(255),
testval4 varchar2(255),
testval5 varchar2(255)
);

----Recurring process

Load your "CSV/TEXT" data into staging table:

some_file_name.ctl: this file contains below load data command.

load data
INFILE 'home/sample/file.csv'
INTO TABLE a_staging_table
APPEND FIELDS TERMINATED BY '|'
(testvalue1,testvalue2,testval3,testval4,testval5);

Now, run the SQL loader to load your data into a staging table form .

sqlldr userid=dbUserName/dbUserPassword control=some_file_name.ctl log=some_file_name.log

Your data is loaded into staging tables. Now join the staging table and your_original_table to identify the record that does not exist.

First Way: Spool the output from the below SQL using SQL*PLUS:

select s.testval3,testval5
        from (select distinct testval3,testval5 
            from a_staging_table) s 
        where not exists
        (select 1
          from your_original_table
           where mycolumn1=s.testval3
             and mycolumn2=s.testval5);

Second Way:

Begin
for x in (
    select s.testval3,testval5
    from (select distinct testval3,testval5 
        from a_staging_table) s 
    where not exists
    (select 1
      from your_original_table
       where mycolumn1=s.testval3
         and mycolumn2=s.testval5)
    ) loop

    DBMS_OUTPUT.put_line('testval3: '||x.testval3 || '      ------     '||'testval5: '||x.testval5);
    --write all these values into another file saying that these are not matching values, using UTL_FILE.

--Then finally truncate the table "a_staging_table"
--so that this data will not available next time, and next time again process will run with different file
end loop;
  • There is no need to suggest a loop again. The OP can spool the output to a csv. Also, the `group by` can be replaced with `distinct` – Kaushik Nayak May 11 '18 at 05:09
  • @KaushikNayak: thanks for pointing out, I have replaced the GROUP BY with DISTINCT –  May 11 '18 at 05:16
  • You're welcome. But, my main concern was with the `loop`, which wouldn't' t be required. If OP can use sqlloader, then, they could also use sqlplus command line from shell script to simply spool the output. – Kaushik Nayak May 11 '18 at 05:20
  • @KKK, thanks for the detailed explanation. Sorry I haven't mentioned this first, the testval5 value will be some values seperated by comma. For example, home/sample/file|testvalue1|testvalue2|testval3|testval4|test1,test2 – Busybee May 17 '18 at 03:07
  • For example, home/sample/file|testvalue1|testvalue2|testval3|testval4|test1,test2,test3. test1, test 2,test 3 values are varchar2 and I need to check them against the 'mycolumn2' you mentioned in the query. The mycolumn2 in the original_table always have a single value like test1 or test2 . Is there a way to pass the values to the staging table like this testvalue1,testvalue2,tesval3,testval4,test1 as 1 row, testvalue1,testvalue2,tesval3,testval4,test2 as another and so on? – Busybee May 17 '18 at 03:30
  • @KKK in some_file_name.ctl, is there a way to pass the file name seperately from unix job. There will be a different file with different file name every day, for example, the below content will be in a file, ABCD_xx_xxdatetimestamp HDR 1 home/sample/file|testvalue1|testvalue2|testval3|testval4|test1,test2 TRL . The next day we get a different file ABCD_x1x2_x3x4datetimestamp – Busybee May 17 '18 at 03:38
  • For Dynamic file name: https://stackoverflow.com/questions/27250002/insert-timestanp-of-infile-into-a-column-from-sqlloader/27280827#27280827 –  May 17 '18 at 04:26
  • For your first case as you mentioned in Comment, use this as: Replace: "and mycolumn2=s.testval5" in both of way with "and mycolumn2 in (select regexp_substr(s.testval5, '[^,]+', 1, level) s from dual connect by regexp_substr(s.testval5, '[^,]+', 1, level) is not null" –  May 17 '18 at 04:38
  • If information are useful to you, you may accept the answer and upvote :) –  May 17 '18 at 04:48
1

A fast way to do this would be to gather all available testval3 and testval5 combinations from the table at the beginning of the script, store them in a hashtable or similar structure, so that when you read each line you'd easily query a local in memory data structure.

It will use more memory, for sure, but it will run a single validation query and speed the program up many times.

The query to run would be select distinct mycolumn,mycolumn2 from mytable or an equivalent.

See

So, in summary, the mechanism I'm proposing is:

  1. Run a query to select all the distinct pairs of testval3 and testval5 from the table

  2. Create a hash table and a specific Pair structure that you store on it. For example, in Java you could use A Java collection of value pairs? (tuples?) and then, if the types of your columns are strings, use something like

HashMap<Pair<String,String>, boolean> pairMap

Make sure to implement the hashCode and equals methods as they are in the other example answer so you can use it as keys on the map properly (if using Java or similar)

  1. Store the result of the query in your hash table, having the pairs of testval3 and testval5 as keys on the table and true as values (how to iterate over the resultset is left as an exercise for the reader):

pairMap.put(new Pair<String,String>(testval3,testval5),true)

  1. Read the file line by line
  2. Foreach line, find if a pair testval3 and testval5 exists on your hash table, if it doesn't then output the line to the CSV. To do that, you can simply query your map and check for null (Key existence check in HashMap)

For example:

 if (pairMap.get(new Pair<String,String>(testval3,testval5)) == null) {
 //output to CSV
 }

Finally, another option as @Kaushik Nayak is saying and @Vivek is implying in the comments, is to load the file onto Oracle with its data loading tools and then run a single query for the values that don't exist.

Vinko Vrsalovic
  • 330,807
  • 53
  • 334
  • 373