1

I'm using Oracle 11g, and I would like to split a column (JobDescription) from the Persons table into separate words.
I.e. If Person A's Job Description is "Professional StackOverflow Contributor", I would like to populate another table with 3 rows containing the 3 words from the Job Description.

From another post here, I managed to get the following which is working for smaller sets of data. But my table contains just less that 500 000 records and the statement has now been running for 2 days and it's still going.

INSERT INTO WORDS (PersonID, Department, Word)
SELECT distinct PersonID, Department, trim(regexp_substr(str, '[^,]+', 1, level)) 
FROM (SELECT PersonID, Department, trim(Replace(JobDescription, ' ', ',')) str 
      FROM Persons) t
CONNECT BY instr(  str  , ',', 1, level - 1) > 0;

Are there another option that might result in quicker results?

Community
  • 1
  • 1
Cameron Castillo
  • 2,712
  • 10
  • 47
  • 77

1 Answers1

0

For a one-off job, I see no reason not to go procedural. This should be quick enough (250 seconds for a 2.5 million row table on my system). Change the size of the varchar2 variables if your words can be longer than 40 characters.

create or replace procedure tmp_split_job as
  TYPE wtype IS TABLE OF NUMBER INDEX BY VARCHAR2(40);
  uwords wtype;
  w varchar2(40);
  i pls_integer;
  n pls_integer;
  p pls_integer;
  cursor c_fetch is select PersonID, Department, JobDescription from Persons where JobDescription is not null;
begin
  for v_row in c_fetch loop
    n := length(v_row.JobDescription);
    i := 1;
    while i <= n loop
      p := instr(v_row.JobDescription, ' ', i);
      if p > 1 then
        w := substr(v_row.JobDescription, i, p-i);
        i := p + 1;
      else
        w := substr(v_row.JobDescription, i);
        i := n + 1;
      end if;
      uwords(w) := 1;
    end loop;
    w := uwords.FIRST;
    while w is not null loop
      insert into words (PersonID, Department, Word) values (v_row.PersonID, v_row.Department, w);
      w := uwords.next(w);
    end loop;
    uwords.DELETE;
  end loop;
end;
/

exec tmp_split_job;
drop procedure tmp_split_job;
Chris
  • 4,133
  • 30
  • 38
  • Thanks. Sadly it is not a once off job, but the procedure might still work even for that purpose. Will give it a try. Tx. – Cameron Castillo May 21 '14 at 12:49
  • Your idea might work but the logic not quite there yet. If I pass the string 'This is the description', I get the result: "This" + "is the" + "the descrip" + "description" – Cameron Castillo May 21 '14 at 13:04
  • Fixed. I admit I did not test this - used java substring semantics (start,end) instead of Oracle's (start,length). – Chris May 21 '14 at 13:41
  • Well, it's not lightning fast, but quicker than the regular expression statement. The proc ran for about 18 hours (1 million words). The regexp is still running after 4 days. Any other improvements will still be welcomed. – Cameron Castillo May 23 '14 at 06:35
  • That seems way too slow. I did a quick check: 2.5M rows with 3 words per jobdescription, resulting in 7.5 million words: 252 seconds. 10g, on a medium fast virtual machine. Do you have constraints on `words`? If so, try disabling or dropping them prior to running the procedure. – Chris May 23 '14 at 08:11
  • I had a primary key constraint. I think the other thing that will cause the biggish time difference is that I amended your script a little to prevent duplicate entries. I added a WHERE NOT EXIST clause. There might be quicker ways of preventing the duplicates. – Cameron Castillo May 23 '14 at 12:05
  • 1
    The easy way: use the unmodified version to insert into a temporary table, followed by an `insert into words select distinct * from tmp_words`. – Chris May 23 '14 at 12:29
  • 1
    I've updated the answer to do the unique filtering in memory using an associative array. – Chris May 23 '14 at 13:27