1

For this project, I intend to make a web version and am right now working on making a PostgreSQL (9.x) backend from which the webapp will query.

Right now, what happens is that the tracer generates a zip file with two CSVs in it, load it into an H2 database at runtime whose schema is this (and yes, I'm aware that the SQL could be written a little better):

create table matchers (
    id integer not null,
    class_name varchar(255) not null,
    matcher_type varchar(30) not null,
    name varchar(1024) not null
);

alter table matchers add primary key(id);

create table nodes (
    id integer not null,
    parent_id integer not null,
    level integer not null,
    success integer not null,
    matcher_id integer not null,
    start_index integer not null,
    end_index integer not null,
    time bigint not null
);

alter table nodes add primary key(id);
alter table nodes add foreign key (matcher_id) references matchers(id);
create index nodes_parent_id on nodes(parent_id);
create index nodes_indices on nodes(start_index, end_index);

Now, since the PostgreSQL database will be able to handle more than one trace, I had to add a further table; the schema on the PostgreSQL backend looks like this (less than average SQL alert as well; also, in the parse_info table, the content column contains the full text of the file parsed, in the zip file it is stored separately):

create table parse_info (
    id uuid primary key,
    date timestamp not null,
    content text not null
);

create table matchers (
    parse_info_id uuid references parse_info(id),
    id integer not null,
    class_name varchar(255) not null,
    matcher_type varchar(30) not null,
    name varchar(1024) not null,
    unique (parse_info_id, id)
);

create table nodes (
    parse_info_id uuid references parse_info(id),
    id integer not null,
    parent_id integer not null,
    level integer not null,
    success integer not null,
    matcher_id integer not null,
    start_index integer not null,
    end_index integer not null,
    time bigint not null,
    unique (parse_info_id, id)
);

alter table nodes add foreign key (parse_info_id, matcher_id)
    references matchers(parse_info_id, id);
create index nodes_parent_id on nodes(parent_id);
create index nodes_indices on nodes(start_index, end_index);

Now, what I am currently doing is taking existing zip files and inserting them into a postgresql database; I'm using JooQ and its CSV loading API.

The process is a little complicated... Here are the current steps:

  • a UUID is generated;
  • I read the necessary info from the zip (parse date, input text) and write the record in the parse_info table;
  • I create temporary copies of the CSV in order for the JooQ loading API to be able to use it (see after the code extract as to why);
  • I insert all matchers, then all nodes.

Here is the code:

public final class Zip2Db2
{
    private static final Pattern SEMICOLON = Pattern.compile(";");
    private static final Function<String, String> CSV_ESCAPE
        = TraceCsvEscaper.ESCAPER::apply;

    // Paths in the zip to the different components
    private static final String INFO_PATH = "/info.csv";
    private static final String INPUT_PATH = "/input.txt";
    private static final String MATCHERS_PATH = "/matchers.csv";
    private static final String NODES_PATH = "/nodes.csv";

    // Fields to use for matchers zip insertion
    private static final List<Field<?>> MATCHERS_FIELDS = Arrays.asList(
        MATCHERS.PARSE_INFO_ID, MATCHERS.ID, MATCHERS.CLASS_NAME,
        MATCHERS.MATCHER_TYPE, MATCHERS.NAME
    );

    // Fields to use for nodes zip insertion
    private static final List<Field<?>> NODES_FIELDS = Arrays.asList(
        NODES.PARSE_INFO_ID, NODES.PARENT_ID, NODES.ID, NODES.LEVEL,
        NODES.SUCCESS, NODES.MATCHER_ID, NODES.START_INDEX, NODES.END_INDEX,
        NODES.TIME
    );

    private final FileSystem fs;
    private final DSLContext jooq;
    private final UUID uuid;

    private final Path tmpdir;

    public Zip2Db2(final FileSystem fs, final DSLContext jooq, final UUID uuid)
        throws IOException
    {
        this.fs = fs;
        this.jooq = jooq;
        this.uuid = uuid;

        tmpdir = Files.createTempDirectory("zip2db");
    }

    public void removeTmpdir()
        throws IOException
    {
        // From java7-fs-more (https://github.com/fge/java7-fs-more)
        MoreFiles.deleteRecursive(tmpdir, RecursionMode.KEEP_GOING);
    }

    public void run()
    {
        time(this::generateMatchersCsv, "Generate matchers CSV");
        time(this::generateNodesCsv, "Generate nodes CSV");
        time(this::writeInfo, "Write info record");
        time(this::writeMatchers, "Write matchers");
        time(this::writeNodes, "Write nodes");
    }

    private void generateMatchersCsv()
        throws IOException
    {
        final Path src = fs.getPath(MATCHERS_PATH);
        final Path dst = tmpdir.resolve("matchers.csv");

        try (
            final Stream<String> lines = Files.lines(src);
            final BufferedWriter writer = Files.newBufferedWriter(dst,
                StandardOpenOption.CREATE_NEW);
        ) {
            // Throwing below is from throwing-lambdas
            // (https://github.com/fge/throwing-lambdas)
            lines.map(this::toMatchersLine)
                .forEach(Throwing.consumer(writer::write));
        }
    }

    private String toMatchersLine(final String input)
    {
        final List<String> parts = new ArrayList<>();
        parts.add('"' + uuid.toString() + '"');
        Arrays.stream(SEMICOLON.split(input, 4))
            .map(s -> '"' + CSV_ESCAPE.apply(s) + '"')
            .forEach(parts::add);
        return String.join(";", parts) + '\n';
    }

    private void generateNodesCsv()
        throws IOException
    {
        final Path src = fs.getPath(NODES_PATH);
        final Path dst = tmpdir.resolve("nodes.csv");

        try (
            final Stream<String> lines = Files.lines(src);
            final BufferedWriter writer = Files.newBufferedWriter(dst,
                StandardOpenOption.CREATE_NEW);
        ) {
            lines.map(this::toNodesLine)
                .forEach(Throwing.consumer(writer::write));
        }
    }

    private String toNodesLine(final String input)
    {
        final List<String> parts = new ArrayList<>();
        parts.add('"' + uuid.toString() + '"');
        SEMICOLON.splitAsStream(input)
            .map(s -> '"' + CSV_ESCAPE.apply(s) + '"')
            .forEach(parts::add);
        return String.join(";", parts) + '\n';
    }

    private void writeInfo()
        throws IOException
    {
        final Path path = fs.getPath(INFO_PATH);

        try (
            final BufferedReader reader = Files.newBufferedReader(path);
        ) {
            final String[] elements = SEMICOLON.split(reader.readLine());

            final long epoch = Long.parseLong(elements[0]);
            final Instant instant = Instant.ofEpochMilli(epoch);
            final ZoneId zone = ZoneId.systemDefault();
            final LocalDateTime time = LocalDateTime.ofInstant(instant, zone);

            final ParseInfoRecord record = jooq.newRecord(PARSE_INFO);

            record.setId(uuid);
            record.setContent(loadText());
            record.setDate(Timestamp.valueOf(time));

            record.insert();
        }
    }

    private String loadText()
        throws IOException
    {
        final Path path = fs.getPath(INPUT_PATH);

        try (
            final BufferedReader reader = Files.newBufferedReader(path);
        ) {
            return CharStreams.toString(reader);
        }
    }

    private void writeMatchers()
        throws IOException
    {
        final Path path = tmpdir.resolve("matchers.csv");

        try (
            final BufferedReader reader = Files.newBufferedReader(path);
        ) {
            jooq.loadInto(MATCHERS)
                .onErrorAbort()
                .loadCSV(reader)
                .fields(MATCHERS_FIELDS)
                .separator(';')
                .execute();
        }
    }

    private void writeNodes()
        throws IOException
    {
        final Path path = tmpdir.resolve("nodes.csv");

        try (
            final BufferedReader reader = Files.newBufferedReader(path);
        ) {
            jooq.loadInto(NODES)
                .onErrorAbort()
                .loadCSV(reader)
                .fields(NODES_FIELDS)
                .separator(';')
                .execute();
        }
    }

    private void time(final ThrowingRunnable runnable, final String description)
    {
        System.out.println(description + ": start");
        final Stopwatch stopwatch = Stopwatch.createStarted();
        runnable.run();
        System.out.println(description + ": done (" + stopwatch.stop() + ')');
    }

    public static void main(final String... args)
        throws IOException
    {
        if (args.length != 1) {
            System.err.println("missing zip argument");
            System.exit(2);
        }

        final Path zip = Paths.get(args[0]).toRealPath();

        final UUID uuid = UUID.randomUUID();
        final DSLContext jooq = PostgresqlTraceDbFactory.defaultFactory()
            .getJooq();

        try (
            final FileSystem fs = MoreFileSystems.openZip(zip, true);
        ) {
            final Zip2Db2 zip2Db = new Zip2Db2(fs, jooq, uuid);
            try {
                zip2Db.run();
            } finally {
                zip2Db.removeTmpdir();
            }
        }
    }
}

Now, here is my first problem... It is much slower than loading into H2. Here is a timing for a CSV containing 620 matchers and 45746 nodes:

Generate matchers CSV: start
Generate matchers CSV: done (45.26 ms)
Generate nodes CSV: start
Generate nodes CSV: done (573.2 ms)
Write info record: start
Write info record: done (311.1 ms)
Write matchers: start
Write matchers: done (4.192 s)
Write nodes: start
Write nodes: done (22.64 s)

Give or take, and forgetting the part about writing specialized CSVs (see below), that is 25 seconds. Loading this into an on-the-fly, disk-based H2 database takes less than 5 seconds!

The other problem I have is that I have to write dedicated CSVs; it appears that the CSV loading API is not really flexible in what it accepts, and I have, for instance, to turn this line:

328;SequenceMatcher;COMPOSITE;token

into this:

"some-randome-uuid-here";"328";"SequenceMatcher";"COMPOSITE";"token"

But my biggest problem is in fact that this zip is pretty small. For instance, I have a zip with not 620, but 1532 matchers, and not 45746 nodes, but more than 34 million nodes; even if we dismiss the CSV generation time (the original nodes CSV is 1.2 GiB), since it takes 20 minutes for H2 injection, multiplying this by 5 gives a time some point south of 1h30mn, which is a lot!

All in all, the process is quite inefficient at the moment...


Now, in the defence of PostgreSQL:

  • constraints on the PostgreSQL instance are much higher than those on the H2 instance: I don't need a UUID in generated zip files;
  • H2 is tuned "insecurely" for writes: jdbc:h2:/path/to/db;LOG=0;LOCK_MODE=0;UNDO_LOG=0;CACHE_SIZE=131072.

Still, this difference in insertion times seems a little excessive, and I am quite sure that it can be better. But I don't know where to start.

Also, I am aware that PostgreSQL has a dedicated mechanism to load from CSVs, but here the CSVs are in a zip file to start with, and I'd really like to avoid having to create a dedicated CSV as I am currently doing... Ideally I'd like to read line by line from the zip directly (which is what I do for H2 injection), transform the line and write into the PostgreSQL schema.

Finally, I am also aware that I currently do not disable constraints on the PostgreSQL schema before insertion; I have yet to try this (will it make a difference?).

So, what do you suggest I do to improve the performance?

Unihedron
  • 10,902
  • 13
  • 62
  • 72
fge
  • 119,121
  • 33
  • 254
  • 329

2 Answers2

1

The fastest way to do bulk insert from a CSV file into PostgreSQL is with Copy. The COPY command is optimized for inserting large numbers of rows.

With Java you can use the Copy implementation for PostgreSQL JDBC driver

There is a nice small example of how to use it here: how to copy a data from file to PostgreSQL using JDBC?

If you have a CSV with headers you would want to run a command similar to this:

\COPY mytable FROM '/tmp/mydata.csv' DELIMITER ';' CSV HEADER

Another performance boost when you are adding large amounts of data to an existing table, is to drop the indexes, insert the data, and then recreate the indexes.

Community
  • 1
  • 1
  • Hmmm, I wasn't aware of that, but then again my CSVs are in a zip to start with and not of the exact form expected by PostgreSQL. I see there is an option to copy from stdin though... – fge Apr 20 '15 at 14:06
  • Also, isn't there a way to _disable_ constraints instead of dropping them entirely on such bulk insertions? – fge Apr 20 '15 at 14:14
1

Here are a couple of measures you can take

Upgrade to jOOQ 3.6

In jOOQ 3.6, there are two new modes in the Loader API:

Using these techniques have been observed to speed up loading significantly, by orders of magnitudes. See also this article about JDBC batch loading performance.

Keep UNDO / REDO logs small

You currently load everything in one huge transaction (or you use auto-commit, but that's not good, either). This is bad for large loads, because the database needs to keep track of all the insertions in your insert session to be able to roll them back if needed.

This gets even worse when you're doing that on a live system, where such large loads generate lots of contention.

jOOQ's Loader API allows you to specify the "commit" size via LoaderOptionsStep.commitAfter(int)

Turn off logging and constraints entirely

This is only possible if you're loading stuff offline, but it can drastically speed up loading if you turn off logging entirely in your database (for that table), and if you turn off constraints while loading, turning them on again after the load.

Finally, I am also aware that I currently do not disable constraints on the PostgreSQL schema before insertion; I have yet to try this (will it make a difference?).

Oh yes it will. Specifically the unique constraint costs a lot on each single insertion, as it has to be maintained all the time.

Operate on more basic char[] manipulation API

This code here:

final List<String> parts = new ArrayList<>();
parts.add('"' + uuid.toString() + '"');
Arrays.stream(SEMICOLON.split(input, 4))
      .map(s -> '"' + CSV_ESCAPE.apply(s) + '"')
      .forEach(parts::add);
return String.join(";", parts) + '\n';

Generates a lot of pressure on your garbage collector as you're implicitly creating, and throwing away, a lot of StringBuilder objects (some background on this can be found in this blog post). Normally, that's fine and shouldn't be optimised prematurely, but in a large batch process, you can certainly gain a couple of percents in speed if you transform the above into something more low level:

StringBuilder result = new StringBuilder();
result.append('"').append(uuid.toString()).append('"');

for (String s : SEMICOLON.split(input, 4))
    result.append('"').append(CSV_ESCAPE.apply(s)).append('"');

...

Of course, you can still achieve write the same thing in a functional style, but I've found it way easier to optimise these low-level String operations using classic pre-Java 8 idioms.

Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509
  • @fge: No, but it will be in a couple of days. I didn't want to create an answer that is outdated in 2-3 days :-) You could already build it from GitHub, just to try bulk and batch loading – Lukas Eder Apr 20 '15 at 14:33
  • 1
    OK, I'll wait until then :p Also, as to more basic APIs, I am aware of the `Stringbuilder` vs concatenation problem, but I'm looking into improving the SQL performance first... And I read my CSVs from zips in the first place, so ideally I'd like to "slurp" from an `Iterator` but then I already mentioned that in the github issue – fge Apr 20 '15 at 14:36