3

I'm quite new to Python, so any help will be appreciated. I am trying to extract and sort data from 2000 .mdb files using mdbtools on Linux. So far I was able to just take the .mdb file and dump all the tables into .csv. It creates huge mess since there are lots of files that need to be processed.

What I need is to extract particular sorted data from particular table. Like for example, I need the table called "Voltage". The table consists of numerous cycles and each cycle has several rows also. The cycles usually go in chronological order, but in some cases time stamp get recorded with delay. Like cycle's one first row can have later time than cycles 1 first row. I need to extract the latest row of the cycle based on time for the first or last five cycles. For example, in table below, I will need the second row.

Cycle#    Time        Data
  1      100.59        34
  1      101.34        54
  1      98.78         45  
  2      
  2
  2   ........... 

Here is the script I use. I am using the command python extract.py table_files.mdb. But I would like the script to just be invoked with ./extract.py. The path to filenames should be in the script itself.

import sys, subprocess, os

DATABASE = sys.argv[1]

subprocess.call(["mdb-schema", DATABASE, "mysql"])

# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", DATABASE],
                               stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()

print "BEGIN;" # start a transaction, speeds things up when importing
sys.stdout.flush()

# Dump each table as a CSV file using "mdb-export",
# converting " " in table names to "_" for the CSV filenames.
for table in tables:
    if table != '':
        filename = table.replace(" ","_") + ".csv"
        file = open(filename, 'w')
        print("Dumping " + table)
        contents = subprocess.Popen(["mdb-export", DATABASE, table],
                                    stdout=subprocess.PIPE).communicate()[0]
        file.write(contents)
        file.close()
Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
Pasha
  • 169
  • 1
  • 1
  • 11
  • I suggest importing the result of `mdb-export` into a SQLite database which you can then query easily. – Pedro Romano Jun 15 '13 at 17:22
  • OP's code above came from http://okfnlabs.org/handbook/data/patterns/liberating-access-databases/ and it works just fine except in Python 3, because Python 3 does not use bytes for default strings as Python 2 does. For Python 3, you'll have to add something like `.decode("utf-8")` to the line with `communicate()[0]` e.g. `communicate()[0].decode("utf-8")` – jspinella Aug 24 '21 at 18:43
  • To add to some of the answers already here, working with accdb/mdb files is kind of a PITA. It seems like nothing (e.g. Pandas) supports it, at least without Microsoft's proprietary driver on Windows or a third-party library like `mdb-tools` on linux/unix... so best approach with Access files is to convert them to something more standard like CSV or SQL to then do your filtering/sorting/transformations/etc. – jspinella Aug 24 '21 at 18:47

3 Answers3

3

Personally, I wouldn't spend a whole lot of time fussing around trying to get mdbtools, unixODBC and pyodbc to work together. As Pedro suggested in his comment, if you can get mdb-export to dump the tables to CSV files then you'll probably save a fair bit of time by just importing those CSV files into SQLite or MySQL, i.e., something that will be more robust than using mdbtools on the Linux platform.

A few suggestions:

  1. Given the sheer number of .mdb files (and hence .csv files) involved, you'll probably want to import the CSV data into one big table with an additional column to indicate the source filename. That will be much easier to manage than ~2000 separate tables.

  2. When creating your target table in the new database you'll probably want to use a decimal (as opposed to float) data type for the [Time] column.

  3. At the same time, rename the [Cycle#] column to just [Cycle]. "Funny characters" in column names can be a real nuisance.

Finally, to select the "last" reading (largest [Time] value) for a given [SourceFile] and [Cycle] you can use a query something like this:

SELECT
    v1.SourceFile, 
    v1.Cycle,
    v1.Time, 
    v1.Data 
FROM 
    Voltage v1 
    INNER JOIN 
    (
        SELECT
            SourceFile, 
            Cycle, 
            MAX([Time]) AS MaxTime 
        FROM Voltage 
        GROUP BY SourceFile, Cycle
    ) v2 
        ON v1.SourceFile=v2.SourceFile 
           AND v1.Cycle=v2.Cycle 
           AND v1.Time=v2.MaxTime
Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
  • OMG, I am so lost. Being new to Python and programming in general makes things look too complicated. Basically I do not need to export final results into new database. All I need is a sorted .CSV file that I can easily access to produce plots, perform calculations, etc. – Pasha Jun 17 '13 at 16:02
  • I understand, but my experience with `mdbtools` is that it *sort of* works, *sometimes*, if you're lucky. I even tried doing a quick test in Linux with `mdbtools` and `pyodbc` using your sample data and the results were not encouraging at all. What you propose would be fairly easy to accomplish on a Windows machine (hence avoiding `mdbtools`), but it is not so straightforward on Linux. So, if you *must* do the work in Linux you could either **(a)** use Python to scan the CSV files themselves, or **(b)** dump the CSV files into another database and then sort/select from there. – Gord Thompson Jun 17 '13 at 17:04
  • For now I just need to save one or two tables from .mdb file into .csv. Then I can sort those files one by one using glob function or similar. The code I used above dumps the tables fairly well. BTW how can I do the same thing on Windows machine? – Pasha Jun 17 '13 at 21:59
2

To bring it directly to Pandas in python3 I wrote this little snippet

import sys, subprocess, os
from io import StringIO
import pandas as pd
VERBOSE = True
def mdb_to_pandas(database_path):
    subprocess.call(["mdb-schema", database_path, "mysql"])
    # Get the list of table names with "mdb-tables"
    table_names = subprocess.Popen(["mdb-tables", "-1", database_path],
                                   stdout=subprocess.PIPE).communicate()[0]
    tables = table_names.splitlines()
    sys.stdout.flush()
    # Dump each table as a stringio using "mdb-export",
    out_tables = {}
    for rtable in tables:
        table = rtable.decode()
        if VERBOSE: print('running table:',table)
        if table != '':
            if VERBOSE: print("Dumping " + table)
            contents = subprocess.Popen(["mdb-export", database_path, table],
                                        stdout=subprocess.PIPE).communicate()[0]
            temp_io = StringIO(contents.decode())
            print(table, temp_io)
            out_tables[table] = pd.read_csv(temp_io)
    return out_tables
kmader
  • 1,319
  • 1
  • 10
  • 13
0

There's an alternative to mdbtools for Python: JayDeBeApi with the UcanAccess driver. It uses a Python -> Java bridge which slows things down, but I've been using it with considerable success and comes with decent error handling.

It takes some practice setting it up, but if you have a lot of databases to wrangle, it's well worth it.

Rein
  • 93
  • 1
  • 11