For my day job, I have been tasked with setting up a computer system to run calculations on a large database of strings. I have establish a proof of concept, but don't have the low-level knowledge to optimize the hardware and software environment. I was hoping for some guidance on this aspect.
Setup:
- 100,000 records in a database containing strings
- I will be performing string similarity calculations to look for approximate duplicates
- i.e. each string against every other string, so ~5 billion calculations
- I wrote the proof of concept in Ruby using SQLite3 as the database using 1000 sample rows
- The total job should run in under a few days - the faster the better, but with diminishing returns. This is a one-time pass, so I don't need a supercomputer if a desktop setup can do it within a few days
What I'm Looking For:
- If I'm building a custom box to run this job (and potentially future jobs of a similar nature), what hardware should I focus on optimizing? I.e. should I spend my limited budget on a very fast GPU? CPU? Large amounts of RAM? I don't know Ruby on a low enough level to know where the bottlenecks for this type of operation are
- Am I missing a better approach? I won't get approval for any major purchases of software or expensive hardware, at least until I can prove this method works with this run through. But can anyone suggest a more efficient method of detecting inexact duplicates?