3

We have some customer data which started in a separate data-store. I have a consolidation script to standardize and migrate it into our core DB. There are somewhere around 60,000-70,000 records being migrated.

Naturally, there was a little bug, and it failed around row 9k.
My next trick is to make the script able to pick up where it left off when it is run again.


FYI:
The source records are pretty icky, and split over 5 tables by what brand they purchased ... IE:

create TABLE `brand1_custs` (`id` int(9), `company_name` varchar(112), etc...)
create TABLE `brand2_custs` (`id` int(9), `company_name` varchar(112), etc...)

Of course, a given company name can (and does) exist in multiple source tables.


Anyhow ... I used the ParseCSV lib for logging, and each row gets logged if successfully migrated (some rows get skipped if they are just too ugly to parse programatically). When opening the log back up with ParseCSV, it comes in looking like:

array(
  0 => array( 'row_id'   =>  '1', 
          'company_name' =>  'Cust A', 
          'blah'         =>  'blah', 
          'source_tbl'   =>  'brand1_cust'
      ),
  1 => array( 'row_id'   =>  '2',
          'company_name' =>  'customer B',
          'blah'         =>  'blah',
          'source_tbl'   =>  'brand1_cust'
      ),
  2 => array( 'row_id'   =>  '1',
          'company_name' =>  'Cust A',
          'blah'         =>  'blah',
          'source_tbl'   =>  'brand2_cust'
      ),
  etc...
)


My current workflow is along the lines of:

foreach( $source_table AS $src){
    $results = // get all rows from $src
    foreach($results AS $row){
        // heavy lifting
    {
}


My Plan is to check the
$row->id and $src->tbl combination
for a match in the
$log[?x?]['row_id'] and $log[?x?]['source_tbl'] combination.

In order to achieve that, I would have to do a foreach($log AS $xyz) loop inside the foreach($results AS $row) loop, and skip any rows which are found to have already been migrated (otherwise, they would get duplicated).
That seems like a LOT of of looping to me.
What about when we get up around record # 40 or 50 thousand?
That would be 50k x 50k loops!!

Question:
Is there a better way for me to check if a sub-array has a "row_id" and "source_tbl" match other than looping each time?


NOTE: as always, if there's a completely different way I should be thinking about this, I'm open to any and all suggestions :)

mOrloff
  • 2,547
  • 4
  • 29
  • 48
  • You might wanna check out http://stackoverflow.com/questions/163336/slicing-a-multi-dimensional-php-array-across-one-of-its-elements – Ian Mar 27 '13 at 16:23
  • That link put me on the scent of some VERY useful info. Thanks-a-bunch. – mOrloff Mar 27 '13 at 16:46

1 Answers1

0

I think that you should do a preprocessing on the log doing a hash (or composed key) of row_id and source_tbl and store it in an hashmap then for each row just construct the hash of the key and check if it is already defined in the hashmap.

I am telling you to use hashed set because you can search in it with O(k) time otherwise it would be the same as you are proposing only that it would be a cleaner code.

frisco
  • 1,897
  • 2
  • 21
  • 29
  • Bingo! This is a perfect example of _"a completely different way I should be thinking about this"_. Thanks. – mOrloff Mar 27 '13 at 16:47