0

My use case is following:

I have a file which contains columns placeName and country. Now given an input place name, I want to find the country. Since place name that might be passed won't be exact match, I wanted to use %LIKE% kind of matching.

Sample data in file:

PlaceName,Country 
Los Angeles,US
Las Vegas,US
Portland,US

Input file:

Place1
place2

Input file which has around 50 million place names. Iterating one by one on that input and then another loop on above mapping in file 1 and doing string matching will be very inefficient. It is O(n*m) where n is number of input palces and m is places in mapping file.

One obvious way is to load this data into database and use it from there. But list of input places is very big (~50 million) so querying database again and again would be very inefficient.

Second approach that I was thinking of - Implement some data structure which will load all data from file and allow 'like' operations in memory. But I don't seem to be able to find it.

Last option I can think of is to - use database only as in approach 1 but use caching. Is there a way to tell framework like Hibernate to load all data from table in cache and do all 'like' operation there?

Any suggestions on possible solutions?

RandomQuestion
  • 6,778
  • 17
  • 61
  • 97
  • 3
    No need to load you data into a database to just use %like%. Java has you covered. There are many ways to do it. We would need to know more about your data and how you've tried to do it to make recommendations. – Damien Black Feb 10 '14 at 23:28
  • 4
    Why do you think that doing this with a database would be inefficient? This is _exactly_ the sort of thing which databases are quite good at. – Matt Ball Feb 10 '14 at 23:29
  • Pre-existing code has been worked on to improve it over time. Trying to naively load 50 million data points and search on it would probably be less efficient than just using a database. – hsun324 Feb 10 '14 at 23:31
  • 1
    @BoristheSpider sorting the file won't help since you can search `"%foo%"`... – Luiggi Mendoza Feb 10 '14 at 23:34
  • @PM77-1, that way I'll have to loop through all my input values and with second loop on mapping that I already have. O(n^2). I was trying to avoid the second loop. – RandomQuestion Feb 10 '14 at 23:34
  • @MattBall, If I am right, in memory matching would be much faster than talking to database, no matter how good databases are, Igven I am able to implement that matching. ;) – RandomQuestion Feb 10 '14 at 23:40
  • 1
    @Jitendra both doing it *in memory* and retrieving it from database would take similar time since after all you need to load the data from disk (high cost) before doing the comparison (in memory). So, I would say to do a proof of concept for your specific environment and configuration and choose the best solution for your case. – Luiggi Mendoza Feb 10 '14 at 23:45
  • @LuiggiMendoza, I meant I would have loaded data only once in memory in starting and then do rest of processing later. This was a one time project. – RandomQuestion Feb 10 '14 at 23:47
  • I am not sure if I understand how this is duplicate of other question. – RandomQuestion Feb 10 '14 at 23:47
  • First you have to measure if your system can support loading the 50 million `String`s in memory. If you're capable of that (without OOM errors), then you could go with your *in-memory* solution. – Luiggi Mendoza Feb 10 '14 at 23:50
  • Using hibernate here is not a good solution since it will just cache the results of the query (as key) with the returned value (as, obviously, the value). – Luiggi Mendoza Feb 10 '14 at 23:51
  • I would be loading only mapping file into memory (which has 8M mapping) and read the places from input file one by one from disk. – RandomQuestion Feb 10 '14 at 23:52
  • 1
    Again: make a proof of concept with both solutions for **your specific environment and configuration**, get the results in numbers to ease the measure of performance e.g. time, memory used, etc., and define your own solution. – Luiggi Mendoza Feb 10 '14 at 23:53

1 Answers1

0

Just use a regular expression, using .* where you would use % in a LIKE clause.

user207421
  • 305,947
  • 44
  • 307
  • 483
  • plz see my updated question. I don't want to put two loops on list of values in input file and values in mapping file. – RandomQuestion Feb 10 '14 at 23:37
  • 1
    So use a database like everybody said. NB if you only have trailing '%' you can use a sorted collection like `TreeMap.` If you have leading '%' you must search the entire collection, or database. If it's possible to disallow leading '%' you should do so. – user207421 Feb 11 '14 at 00:05