1

I'm currently attempting to come up with a solution to the following problem:

I have been tasked with parsing large (+-3500 Lines 300kb) pipe delimited text files and comparing them line by line to corresponding codes within our database. An example of a file would be:

File name: 015_A.txt

File content (example shows only 4 lines):

 015|6999|Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.|1|1|0|0|2016/01/01
 015|3715|It has roots in a piece of classical Latin literature from 45 BC|1|1|213.5|213.5|2016/01/01
 015|3724|Making it over 2000 years old.|1|1|617.4|617.4|2016/01/01
 015|4028|Words will go here.|1|1|74|74|2016/01/01

I will be providing a web interface which I have already built to allow a file to be selected from the browser and then uploaded to the server.

Using the above example pipe file I will only be using these:

Code (using above line 1 as an example: 6999)

Price (using above line 1 as an example: 0)

I would then (to my mind not sure if this is best method) need to run a query (our DB is MSSQL) for each line, example:

SELECT t.Price
FROM table t
WHERE t.code = '6999'

If t.Price === 0 then line 1 has passed. As it is equal to the source file.

This is where I believe I just needed to ask some advice as I am sure there are many ways to tackle this problem, I would just like to, if possible be pointed in the direction of doing this in an efficient manner. (Example best method of parsing the file? Do I run a query per code or rather do a SQL statement using an IN clause and then compare every code and price? Should I scrap this idea and use some form of pure SQL tool bearing in mind I have pipe file to deal with / import.)

Any advice would be greatly appreciated.

BernardV
  • 640
  • 10
  • 28

1 Answers1

1

Your story appears to end somewhat prematurely. Is the only thing this script should do is check the values in the database match the files in the file? If so, it would be simpler just to extract the data from the database and overwrite the file. If not, then this implies you need to retain some record of the variations.

This has some bearing on the approach taken to the reconcilliation; running 3500 queries against the database is going to take some time - mostly spent on the network and in query parsing (i.e. wasted). OTOH comparing 3500 records in a single SELECT to find mismatches will take no time at all.

The problem is that your data is out at the client and uploading via a browser only gets it halfway to the database. If you create another table on the database (not a temporary table - add a column to represent the file) it is possible to INSERT multiple rows in a single DML statement, but really you should batch them up in lots of 100 or so records, meaning you only need to execute 36 queries to complete the operation - and you've got a record of the data in the database which simplifies how you report the mismatches.

You probably should not use the DBMS supplied utilities for direct import unless you ABSOLUTELY trust the source data.

Community
  • 1
  • 1
symcbean
  • 47,736
  • 6
  • 59
  • 94
  • agree.same with my answer – plonknimbuzz Nov 15 '16 at 13:01
  • Thanks for the response, I will try to elaborate further. These .txt pipe files are provided by another business to us, we then use existing methods to import many of these files into our DB, this then updates pricing on our front ends. To that end I have been tasked to help our QA effort in verifying that after an import is completed all codes and their corresponding prices are within the DB and match what has been imported. – BernardV Nov 15 '16 at 15:16
  • That's a different question to the one you asked; you also need to ensure that the files contain all the records they should (which requires loading the data into the database) and don't contain records they shouldn't. And it also implies a very wide feedback loop (i.e. a bad QA process). – symcbean Nov 15 '16 at 15:19