0

I am working on a project that requires me to take a live Twitter feed and store records from it in a PostgreSQL database. The project requires that the tweets' location data be stored for searching under PostGIS. I'm using a perl script to get the Twitter feed (using AnyEvent::Twitter::Stream and the Twitter API). Every 5000 tweets, the script fork()s and the child process issues the SQL to insert the rows. I'm using AutoCommit => 0 to speed up the inserts.

The problem is that the child process isn't done storing the 5000 tweets before the next 5000 come in, so I get numerous postgres processes. I need to figure out how to speed up the database inserts enough to allow the child process to exit before the next one is started.

The tasks the child process does now (for each tweet) are:

  • Insert a record in the tweets table, using ST_GeomFromEWKT to convert the latitude/longitude data to GIS coordinates
  • Insure that the author of the tweet and any users mentioned in the tweet are in the users table
  • Insert mentions of users and hashtags in the relevant tables

Any advice on diagnosing the speed or speeding up the process would be most helpful. This eventually has to work in real time, so temporary tables and text files are not good options. The server is a dual-Xeon HP server running Debian with 8G of RAM.

  • [Bulk load][1], and consider increasing it to more than 5000? [1]: http://stackoverflow.com/questions/758945/whats-the-fastest-way-to-do-a-bulk-insert-into-postgres – runrig Nov 08 '13 at 23:41
  • How can you ask about performance and then fail to give any figures? Are you fetching 5000 tweets per hour, minute, second? What disks are you using 16 x SSD in a RAID10, USB stick? – Richard Huxton Nov 09 '13 at 07:44

2 Answers2

4

In the postgres docs is a comment on speeding up inserts by misusing the insert from select clause. This seems to be a significant difference, have you tried that?

Useful tip for faster INSERTs: You can use the INSERT INTO tbl <query> syntax to accelerate the speed of inserts by batching them together. For example...

INSERT INTO my_table SELECT 1, 'a' UNION SELECT 2, 'b' UNION SELECT 3, 'c' UNION ...

If you batch up many sets of values per INSERT statement and batch up multiple INSERT statements per transaction, you can achieve significantly faster insertion performance. I managed to achieve almost 8x faster inserts on a PostgreSQL 8.1 / Win2K installation by batching up 100 (small) using this technique.

Otherwise, if you cannot get the postgres up to the required speed, you may check your IO performance on the HP box.

Also, check if there are many indexes to be updated after insert. Maybe you even need to say goodbye to many of your constraints (FK constraints). This would allow to insert the records in any order and there is no need to wait for a user to be created before inserting the tweet.

I would also check, if there is a possibility to check for the users in the db while you collect the tweets. Last but not least, you should implement a queue to insert the batches of 5000 tweets and not simply fire them off to the db.

thst
  • 4,592
  • 1
  • 26
  • 40
  • Great response, and bonus for suggesting collecting the tweets into a queue. – ChuckCottrill Nov 08 '13 at 23:12
  • Dropping the separate check for username has fixed the problem, so it was the lookup that was too slow. How is a queue procedure better than just using a single transaction? – Andrew Perrin Nov 10 '13 at 17:02
  • If your database is ever becoming too slow to process your requests, your application will overload the database. With the queue approach, you may even stop the database during high volume processing, start it again and eventually it will catch up to be "realtime". The queue does not replace the transaction. It is queueing up your data to be processed before the transaction starts. It will be removed from the queue immediately if the db is idle. – thst Nov 10 '13 at 22:19
0

I've benchmarked performance of creating points, and ST_GeomFromEWKT is the slowest method. Try using ST_MakePoint in a prepared statement to minimize any overhead:

use DBI;

# Prepare an insert
$sth=$dbh->prepare("INSERT INTO mytable (geom) ".
                   "SELECT ST_SetSRID(ST_MakePoint(?, ?), 4326) AS geom");

# In a for-loop of 5000 points, do the insert
$sth->execute($longitude, $latitude);
Community
  • 1
  • 1
Mike T
  • 41,085
  • 18
  • 152
  • 203