I am working on a project that requires me to take a live Twitter feed and store records from it in a PostgreSQL database. The project requires that the tweets' location data be stored for searching under PostGIS. I'm using a perl script to get the Twitter feed (using AnyEvent::Twitter::Stream and the Twitter API). Every 5000 tweets, the script fork()s and the child process issues the SQL to insert the rows. I'm using AutoCommit => 0 to speed up the inserts.
The problem is that the child process isn't done storing the 5000 tweets before the next 5000 come in, so I get numerous postgres processes. I need to figure out how to speed up the database inserts enough to allow the child process to exit before the next one is started.
The tasks the child process does now (for each tweet) are:
- Insert a record in the tweets table, using ST_GeomFromEWKT to convert the latitude/longitude data to GIS coordinates
- Insure that the author of the tweet and any users mentioned in the tweet are in the users table
- Insert mentions of users and hashtags in the relevant tables
Any advice on diagnosing the speed or speeding up the process would be most helpful. This eventually has to work in real time, so temporary tables and text files are not good options. The server is a dual-Xeon HP server running Debian with 8G of RAM.