Topological sort for random data generator for relational database

Question

I am developing a random data generator for relational database and for tables there are no relationships the generator behaves correctly, however where there are tables that have keys, the generator breaks down.

I researched a bit and found that a topological sorting algorithm could solve the problems, but I'm not able to associate that idea with my algorithm.

Topology ordering algorithm (http://blog.jupo.org/2012/04/06/topological-sorting-acyclic-directed-graphs/)

def topolgical_sort(graph_unsorted):
    """
    Repeatedly go through all of the nodes in the graph, moving each of
    the nodes that has all its edges resolved, onto a sequence that
    forms our sorted graph. A node has all of its edges resolved and
    can be moved once all the nodes its edges point to, have been moved
    from the unsorted graph onto the sorted one.
    """

    # This is the list we'll return, that stores each node/edges pair
    # in topological order.
    graph_sorted = []

    # Convert the unsorted graph into a hash table. This gives us
    # constant-time lookup for checking if edges are unresolved, and
    # for removing nodes from the unsorted graph.
    graph_unsorted = dict(graph_unsorted)

    # Run until the unsorted graph is empty.
    while graph_unsorted:

        # Go through each of the node/edges pairs in the unsorted
        # graph. If a set of edges doesn't contain any nodes that
        # haven't been resolved, that is, that are still in the
        # unsorted graph, remove the pair from the unsorted graph,
        # and append it to the sorted graph. Note here that by using
        # using the items() method for iterating, a copy of the
        # unsorted graph is used, allowing us to modify the unsorted
        # graph as we move through it. We also keep a flag for
        # checking that that graph is acyclic, which is true if any
        # nodes are resolved during each pass through the graph. If
        # not, we need to bail out as the graph therefore can't be
        # sorted.
        acyclic = False
        for node, edges in list(graph_unsorted.items()):
            for edge in edges:
                if edge in graph_unsorted:
                    break
            else:
                acyclic = True
                del graph_unsorted[node]
                graph_sorted.append((node, edges))

        if not acyclic:
            # Uh oh, we've passed through all the unsorted nodes and
            # weren't able to resolve any of them, which means there
            # are nodes with cyclic edges that will never be resolved,
            # so we bail out with an error.
            raise RuntimeError("A cyclic dependency occurred")

    return graph_sorted

Algorithm developed

def main():
try:
    config = json.loads(open('config.json').read())

    sb = StringBuilder()
    sb.append("host=%s " % config['host'])
    sb.append("dbname=%s " % config['dbname'])
    sb.append("user=%s " % config['user'])
    sb.append("password=%s " % config['password'])
    sb.append("port=%s " % config['port'])

    conn = psycopg2.connect(sb.to_string())
    cur = conn.cursor()

    # Getting all tables from database
    sb.clear()
    sb.append("SELECT table_name ")
    sb.append("FROM information_schema.tables ")
    sb.append("WHERE ")
    sb.append("table_type = 'BASE TABLE' ")
    sb.append("AND ")
    sb.append("table_schema NOT IN ('pg_catalog', 'information_schema')")

    cur.execute(sb.to_string())

    for table_name in cur.fetchall():
        sb.clear()
        sb.append("SELECT ")
        sb.append("COLUMN_NAME, DATA_TYPE, ")
        sb.append("IS_NULLABLE, CHARACTER_MAXIMUM_LENGTH ")
        sb.append("FROM INFORMATION_SCHEMA.COLUMNS ")
        sb.append("WHERE TABLE_NAME = '%s'" % table_name[0])

        cur.execute(sb.to_string())

        row = cur.fetchall()

        for _ in xrange(1, config['number_inserts'] + 1):
            sb.clear()
            sb.append("INSERT INTO %s(" % table_name[0])

            for columns in row:
                rdgrd = Rdgrd(Attr(columns[0], columns[1], columns[
                    2], columns[3]))

                if rdgrd.datatype_is_supported() == 'jump':
                    continue
                else:
                    sb.append("%s," % columns[0])

            _sb = sb.to_string()[:-1]
            sb.clear()
            sb.append(_sb)
            sb.append(") VALUES(")

            for columns in row:
                rdgrd = Rdgrd(Attr(columns[0], columns[1], columns[
                    2], columns[3]))

                if rdgrd.datatype_is_supported() == 'jump':
                    continue
                else:
                    sb.append("'%s'," % rdgrd.generate_data())

            _sb = sb.to_string()[:-1]
            sb.clear()
            sb.append(_sb)
            sb.append(")")

            print sb

            cur.execute(sb.to_string())

    conn.commit()
    conn.close()

except psycopg2.OperationalError as e:
    print e
    logging.error(e)
    sys.exit(1)
except Exception as e:
    print e
    logging.error(e)
    sys.exit(1)

Any suggestions are welcome.

There is a lot missing. 1) the ANSI information_schema.tables also contains table_schema 2) you can obtain the FK-constraints (the graph) from the meta-schema, too (constraint_column_usage,constraint_table_usage, maybe a few others) 3) you should set the constraints to DEFERRED before any data 4) you should (probably) put allyour additions to the tables between BEGIN and END to make them occur in a single transaction. 5) inserting with `VALUES()` is cumbersome for larger amounts of data. 6) ifyou only target postgres DBMS, you could use pg_catalog instead of ANSI meta-schema.(is more complete) — wildplasser, Jun 19 '17 at 20:21
The problem is: this is not really aquestion for SO, but your attempt looks good. Youchould *really* make a choicebetween imposing the structurefrom the client side(which you do) or fetching it from the actual catalogs (which is *hard*). Your implementation is somewhat hybrid, IMHO. — wildplasser, Jun 19 '17 at 20:42
Analyzing the problem, I'd better go to the client side (which you do). Maybe this will make it easier to work out the algorithm. Thank you. — Ronald Araújo, Jun 19 '17 at 20:45
Your title describes your project; it should describe your question. Nb: Tables/relations represent relationhips/associations; PKs & FKs are different things; "key" alone means either relational PK, candidate key or superkey or SQL PK/UNIQUE (which can mean relational PK or superkey); it never means FK; FKs get called "relationships" but are constraints. PS You want an answer related to a transitive closures. Read about [SQL hierarchies & CTEs](https://stackoverflow.com/questions/4048151/what-are-the-options-for-storing-hierarchical-data-in-a-relational-database). — philipxy, Jun 20 '17 at 04:02
PS Don't forget to research [sql & database topological sort & other graph algorithms](https://www.fusionbox.com/blog/detail/graph-algorithms-in-a-database-recursive-ctes-and-topological-sort-with-postgres/620/) including [stackoverflow](https://stackoverflow.com/q/44076722/3404097). Remember that algorithmic costs depend on the execution model--and given your data is in a relational database/DBMS (and possibly given a certain implementation's representations/optimizations) an appropriate algorithm can be very different than for a typical 3gl virtual machine. — philipxy, Jun 20 '17 at 22:39

Topological sort for random data generator for relational database

0 Answers0