Best way to download and process a Very Large Compressed MS Access File to be loaded in Oracle

Question

I have to download a 250mb~ ZIP File password encrypted through FTP. After downloaded I have to unzip it with a common password, the Zip file would contain an 1.5GB MS Access DB that I have to read and make some joins with some tables in my DB Oracle and transform and load that Data into that Oracle DB.

I'm looking for the best way to do this process. I'm a c# developer, so my first thought was to use c#, download the file via FtpClient or FtpWebRequest, then use a zip library like DotNetZip and open the MS Access dataBase via ODBC and load the records into Oracle with ODP.NEt, I think that's is my "easy way", cuz I know how to do it.

But since this a big file and I know this could take a long time, I'm concerned about time and efficiency and how to reduce the time of this process.

So I'm thinking that processing all the request directly into oracle(download the FTP from there, unzip it there, and process the information directly in there would reduce times like passing record by record from c# to oracle) should reduce the time of this process, but I'm not sure if this is the correct way of doing this.

So I started to look into librarys from oracle that could do what I'm trying to archieve and I found the PLSQL-utils and seems like they can do everything that I need except reading the MS Access DataBase and I started looking about that and found the Heterogeneous Services but I have never used them so I'm little lost about that.

Also I heard once that I could use Java directly from Oracle, and I know java can connect to MS Access via JDBC. So I searched about that and found something about Calling Java Methods in Oracle Database

That's what I have so far, but I don't know which method should I use, I mean, RDBMS as far as I know, are meant for processing data but not for programming things like downloading files or something like that, that's why we have OOP's languages.

As an additional information, this process is going to execute once or twice for month so I have to schedule it, if it is in oracle, can easily be done with an schedule job, or in c# with a Scheduled Task or Windows Service (those are the tools that I know)

Some restrictions that I have

My client don't have a MS SQL Server and neither can buy a license for it (So I cannot use DTSX for this process)
In the Oracle production server maybe I won't have enough permissions to do all the things, but I can comply for those if they are the best for the process
If a backend server (Java, c# hosted on IIS or WebLogic or JBoss or anykind) is going to be required, this Server and the Oracle Server would be differents
Oracle database hosted on Unix Server

Being said all of this, how can I efficiently do all this process, should I use .net and load record by record in my Oracle DataBase? Should I do everything in oracle?Or none of this? Is there a better way to do this?

Have you searched about oracle database links? There is a way to connect oracle to another database, look at this post http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:4406709207206 there is a link pointing to the documentation but is broken, so you woll gave to find where it is. — Jorge Campos, Oct 18 '13 at 16:20
yes I did i think is on my comment about the Heterogeneous Services , refering to dblink — Hector Sanchez, Oct 18 '13 at 16:48

score 2 · Accepted Answer · edited May 23 '17 at 12:28

2

I think you're on the right track with a C# console application to make it a repeatable process. Here is great free library I've used for zip on many projects.

using (var client = new WebClient())
using (var stream = client.OpenRead(@"ftp://mysite.com/mydb.zip"))
using (var file = File.Create(@"c:\temp\mydb.zip"))
{
    stream.CopyTo(@"c:\temp\mydb.zip", 32000);
}

using (ZipFile zip = ZipFile.Read(@"c:\temp\mydb.zip"))
{
    ZipEntry e = zip["bigdb.mdb"];
    e.Password = "yourpassword";
    e.Extract("c:\temp\bigdb.mdb");
}

Once unpacked, you can create a data connection to the access DB and datareader object. Then use the dbreader to read rows and write to flat file (avoids out of memory exception with large data sets).

private constr = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=yourdbfile.mdb;Jet OLEDB:Database Password=yourpassword;";
OleDbConnection conn = new OleDbConnection(constr);
string query = "SELECT * FROM [YourTable]";

OleDbCommand cmd = new OleDbCommand(query, conn);
OleDbDataReader reader = cmd.ExecuteReader();
int rowNum = 0;
StringBuilder sb = new StringBuilder(); 
while (reader.Read())
{
   // write rows to flat file in chunks of 10K rows.
   sb.Append(reader["FieldA"].ToString() + "|");
   sb.Append(reader["FieldB"].ToString() + "|");
   sb.Append(reader["FieldC"].ToString() + System.Environment.NewLine);

   if (rowNum % 10000 == 0)
   {
        File.AppendText(@"c:\temp\data.psv", sb.ToString());
        sb = new StringBuilder(); 
   }
   rowNum++;
}
File.AppendText(@"c:\temp\data.psv", sb.ToString());
reader.Close();

After you have your data table filled you can then export it to a flat file. I would not suggest inserting data row by row, that will be incredibly slow and it will bloat your Oracle db transaction logs. I don't believe Oracle 10g has a .Net driver that supports bulk loading, so you'll probably need to bulk load via a flat file.

Next, import into Oracle via command line, you can invoke this from your C# console app. Before you do this you'll need to have created a control file, ctl.ldr, first which is used by Oracle for bulk load operations.

options (skip=1)
load data
 INFILE 'c:\temp\data.psv'
 INTO table tblTest
 APPEND
 FIELDS TERMINATED BY "|" optionally enclosed by '"'      
 ( fielda,fieldb,etc...)

and then 
run it in as follows via command line

sqlldr username/pswd@oracle_sid control=ctl.ldr

Hopefully this helps, good luck!

[Edit]

You might also have a look at the .Net Oracle Bulk copy class. This was shipped with the Oracle 11g client drivers. Perhaps it will still work against your 10g server. A potential problem there, is that all your other apps on that same application server would need to work with these newer 11g client drivers too. Another option is to build a Java application that uses the Jena framework which supports bulk loading.

edited May 23 '17 at 12:28

Community

1
1

answered Oct 18 '13 at 16:04

James

12,636
12
67
104

I think `a.Fill(t1)`;would throw an `OutOfMemoryException`, I have used it before with a fill with 250k records and failed, this access file weights 1.5GB could be millions of records, the other way of oracle seems very reasonable, let me take a look of that, thanks for your effort – Hector Sanchez Oct 18 '13 at 16:21
Good point, you don't need to use a datatable, you can use a datareader and read the rows in a streaming manner, it just takes more code. – James Oct 18 '13 at 16:31
thanks!, remember is an FTP, but thanks for your recommendation of the WebClient, i think would be using ftwebrequest – Hector Sanchez Oct 18 '13 at 16:33
Check my last edit, I re-worked the flat file export to use datareader with a chunked write pattern increase chunk size for better performance taking into consideration available RAM for string builder. – James Oct 18 '13 at 16:48
thanks again, and sorry about this, i forgot to mention it on the question, i just edited it, my back end server and database server would be in different locations, I like your approach, but generating the txt file I have to copy it or generate it directly in oracle, which one should be faster – Hector Sanchez Oct 18 '13 at 16:54
Could you change the output location of the .psv file, to a file share on your db server and have a cron job on your Oracle server that imports new flat files in that directory on a recurring basis? You can also invoke remote commands via process object in C# too. – James Oct 18 '13 at 16:59
for my project I can't, for client rules I can't have a shared location. But for this question let's say that I can, but this should be faster than processing everything in the oracle Server? I thank you very much for your effort but I think that would impact two areas, I will have another file with a considerable size (the csv file[i think you are referring to csv isntead of psv]) and I would have time performance for consuming the bandwith, that without mentioning that opening the access file in c# and creating csv file takes time(maybe a long time),and I still have to join the tables later – Hector Sanchez Oct 18 '13 at 17:54
.psv is for `pipe` separated values, comma separated files are almost always problematic for import. It really should not take long at all to read out 100k's of records and write them to a flat file once Access db is downloaded, I would guess less than 10 minutes. The 1.5 gig file download is a bigger problem IMO. Perhaps you should take a step back and forget about large Access files. Where is this data coming from? Can they expose it in a better way, or just give you flat files to begin with? – James Oct 18 '13 at 18:00
Really downloading 250mb zip file, it has an 1.5Gb mdb file inside. This in an public FTP Server with password access. It can be exposed in a better way? My process has to be automatic, The government put that Zip File to everyone, not just my client, but people that have access to it, have their own credentials, I think the downloading step is unmissable and also the format of the file.The thing I see is that we are just converting tit to psv, copying it to somewhere the oracle server can see it and loading again, (being repeatable for me), when maybe could be acomplished with a dblink – Hector Sanchez Oct 18 '13 at 18:17
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/39517/discussion-between-james-and-mr) – James Oct 18 '13 at 18:34

James · Answer 2 · 2013-10-18T20:11:19.497

Check out Pentaho. This is basically a Java open source SSIS like product. After you have your Access db downloaded and extracted this tool should do the rest to get your data loaded into Oracle. I've used it before with MySQL and It works very well, and it now has a bulk Oracle db load component. Pentaho does not need to be run on your db server either, so this may be the silver bullet your looking for. The tradeoff is that there is going to be some ramp up time.

Another potential option assuming that your Oracle sever is running Windows Server, and you can land the Access file somewhere local on that server or in a place where the server can read it, you can do the following.

Create a C# app to download, extract and copy the Access db to an Oracle accessible location.
Create a local System DSN (ODBC entry) on the db server to the Access db.
Configure and Start the Oracle Listener, HS and connectivity settings.
Add a database link.
Join with linked db and query.

This article outlines the entire process of configuring Oracle db link to access.

score 1 · Answer 3 · answered Oct 15 '13 at 22:30

I would create an application or a webpage or a WCF service that can be called that will download the file, unzip it, and insert the records directly into Oracle row by row. I would try to save all the data in Oracle as much as possible. Get it and save it. I wouldn't mess around with zip files a lot once it is download.

Basically:

Data ->Processor application ->Oracle

This way things can be automated. I would do the processor in C#.

score 1 · Answer 4 · answered Oct 18 '13 at 16:26

As I asked on the comment it is possible to create a DBLink to another database from oracle. So I would create a application to download and unzip the file to a especific area (area that your oracle server can read). Then I would set up a DBLink to this file following this directions https://forums.oracle.com/thread/407779 or as in asktom http://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:4406709207206

and then, import the data directly from the oracle reading the msAccess db.

randiel · Answer 5 · 2013-10-24T21:21:43.147

I have two Powershell script, the first one help me to download ftp files from China:

Download FTP script

# Parametros
param([string]$url, [string]$user, [string]$pass, [string]$folder, [string]$fecpro)


if (!$fecpro)
{
    $date = Get-Date
    $date = $date.AddDays(-1)
    $fecpro = '{0:yyyyMMdd}' -f $date
}

$FtpServer     = "ftp://$url/$fecpro"
$Username      = $user
$Password      = $pass
$RemoteFolder  = $FtpServer
if (!$folder)
{
    $LocalFolder   = $pwd.path + "\"
} 
else 
{
    $LocalFolder   = $folder + "\"
    if (-not(Test-Path -Path $LocalFolder))
    {
        New-Item -ItemType Directory -Force -Path $LocalFolder
    }
}
Write-Host "Searching *.zip files from $RemoteFolder..."

$existfiles = gci $LocalFolder | Where-Object { $_.name -like "*.zip" }

$cntFile = 0
$totFiles = 1
$listFiles = @(0)*$totfiles
$listFiles[1] = "cs2-" + $fecpro + "-6004-0000-1.zip"

$existfiles = gci $LocalFolder | Where-Object { $_.name -like "*.zip" }

do
{
    try
    {
        $f_exist = $false
        $cntFile = $cntFile + 1
        if ($existfiles) {
            foreach($file in $existfiles) {
                $tmpname = $file.name.ToString()
                $bool = $listFiles[$cntFile].CompareTo($file.name)                              
                if ($bool -eq 0) {
                    $f_exist = ($f_exist -or $true)
                } else {
                    $f_exist = ($f_exist -or $false)
                }
            }
        }

        if (!$f_exist -and $rescnt['CNT'] -eq 0) {
            $ftpfile = $RemoteFolder + "/" + $listFiles[$cntFile]
            $FTPRequest1 = [System.Net.FtpWebRequest]::Create($ftpfile)
            $FTPRequest1.Credentials = New-Object System.Net.NetworkCredential($Username,$Password)
            $FTPRequest1.Method = [System.Net.WebRequestMethods+Ftp]::DownloadFile
            $FTPRequest1.UsePassive = $true
            $FTPRequest1.UseBinary = $true
            $FTPRequest1.KeepAlive = $false
            $destfile = $LocalFolder + $listFiles[$cntFile]
            $file = New-Object IO.FileStream ($destfile, [IO.FileMode]::Create)
            $FTPResponse = $FTPRequest1.GetResponse()
            $ResponseStream = $FTPResponse.GetResponseStream()
            [byte[]] $buffer = New-Object byte[] 64
            $FTPReader = $ResponseStream.Read($buffer, 0, 64)
            while($FTPReader -ne 0){
                $FTPReader = $ResponseStream.Read($buffer,0, 64)
                $file.Write($buffer,0,$FTPReader)
            }
            Write-Host "File $ftpfile is downloaded in local folder."       
            $file.Close()               
            $FTPReader.Close()
            $FTPResponse.Close()
            $ResponseStream.Close()

        } 
        else {
                Write-Host "The file $name already exists in local folder."
        }
    }
    catch
    {
        if ($file) {
            $file.Close()
        }

    }
}
while ($cntFile -lt $totFiles+1)

My second script unzip the files using unzip.exe that comes with Oracle client. (%ORACLE_HOME%\product\10.2.0\client_1\BIN)

Unzip files script

param([string]$folder_origen, [string]$folder_destino, [string]$fecpro)

$P_FOLDER = $folder_origen + "\"
$P_DESTINO = $folder_destino + "\"

if (!$fecpro)
{
    $date = Get-Date
    $date = $date.AddDays(-1)
    $fecpro = '{0:yyyyMMdd}' -f $date
}

Write-Host "Searching *.zip files in $P_FOLDER"

# Inicio del programa
$LocalFolder = $P_FOLDER + "*$fecpro*.zip"
$filelocation = gci $LocalFolder
foreach ($file in $filelocation){
    if (($file) -and ($file.length -gt 0kb))
    {
        $shell = new-object -com shell.application
        $filename = $file.name.ToString()
        $split = $filename.Split(".")
        $dest = $P_DESTINO + $split[0]
        If (!(Test-Path $dest))
        { 
            New-Item $dest -Type Directory
        }
        $orifile = $P_FOLDER + $filename
        $time = "{0:hh_mm_ss.ffff}" -f (get-date)
        $prefile = "d." + (get-date).ToShortDateString() + ".t." + $time
        $prefile = $prefile -replace "/", "_"
        $filename = $prefile + ".f." + $filename
        $P_DESTZIP = $P_FOLDER + "proc" + "\"
        If (!(Test-Path $P_DESTZIP))
        { 
            New-Item $P_DESTZIP -Type Directory
        }
        #Copy-Item $orifile "$P_DESTZIP$prefile_$filename"
        $zipname = $orifile
        #"$P_DESTZIP$prefile_$filename"
        $zip = $shell.NameSpace($zipname)
        $destn = $shell.Namespace($dest)

        unzip -o -qq $zipname "*$fecpro*afsbjdtl.*" "*$fecpro*afgnled.bin" -d $dest

        Write-Host "$orifile was renamed and unzipped."
    } else
    {
        Write-Host "Zip files not found in the folder."
    }
}

I search files to extract with a pattern when i could use filters like the size of file, the name, the extension and it's very fast.

Finally about load i sugest the command LOAD DATA too. It's have the best performance.

score 1 · Answer 6 · answered Oct 24 '13 at 23:38

1

I'd say you should better use tools, that you know best. It is possible to do the same in java and then load it into the database or run as separate application, but why bother? I belive you know efficient way to download/unzip/open access database in C#.

So only one thing is missing, how effectively upload your data from access database into oracle database.

Short answer is - bulk inserts, may be with conjunction of temporary tables and some help of stored procedures for further bulk processing. More speedup can be achieved with database/table custom tuning to fit your data better.

You can do the same in java too, but I doubt that it will be faster. Your right work with buffering data and bulk loading is much more important then any other factors. After that, grab some DBA and bug em to help you with database tuning. Details is out of the scope of this question.

answered Oct 24 '13 at 23:38

Nagh

1,757
1
14
19

I will look about making bulk inserts, I don't know if can be done directly from c#, or I have to make a flat file copy it to oracle and run sql loader. And I know I won't gain much time with java or .net, but I was thinking on using java programs directly in oracle, that's why I thought about that, if I use it on a webService Java or .NEt would be the same thing, but using java directly in oracle though java programs or SQLJ might be faster. – Hector Sanchez Oct 25 '13 at 15:42
@Mr Actually it won't be much faster, only if you access database through network and will have network as bottleneck. There were C++ library from oracle that was able to do bulk inserts. Inside or outside of the database won't make much difference. But using BULK INSERTs you'll make your queries run few times as faster. So dig into that direction. Fallback to C++ or Java if you find out that C# can't do BULK INSERTs. – Nagh Oct 25 '13 at 17:41
It can do it, I'm making a quite example for testing that, keep you posted and thanks for your valuable time. Seems like bulk inserts is the thing that i was missing – Hector Sanchez Oct 25 '13 at 18:19

score 1 · Answer 7 · answered Oct 25 '13 at 00:03

Excuse me if I'm stating the obvious but if your main performance concern is the speed of ftp/sftp transfer of the msaccess.db then it might make sense to have the access db periodically export the subset of the tables/records that you actually need access to in order for you to make the needed updates to Oracle.

A few years back I found that adding records to Oracle using "ArrayDML" was about 10x as fast as adding a single record at a time. (Not sure if that is supported in odp.net or not...)

let me take a look at ArrayDML, the Access DB changes every month, that's why I have to download it everytime — Hector Sanchez, Oct 25 '13 at 15:17

Best way to download and process a Very Large Compressed MS Access File to be loaded in Oracle

7 Answers7