Integrating Lucene Index and Amazon AWS

Question

I have a an existing index of lucene index files and the java code to perform search functions on it.

What I would like to do is perform the same thing on a server so users of an app could simply pass a query that will be taken as an input parameter by the java program and run it against the existing index to return the document in which it occurs.

All the implementation has been tested on my local pc,but what I need to do is implement it in an Android app.

So far I have read around and concluded that porting the code in AWS lambda and using S3 to store the files and calling the s3 objects from lambda.

Is this the right approach?Any resources that point to the this approach or alternative suggestions are also appreciated.

Thanks in advance.

Have you considered using Elasticsearch (https://github.com/elastic/elasticsearch)? It's a search engine built on top of Lucene that provides many great features, one of which is a REST API that you could use to query your index. — jzonthemtn, Aug 07 '16 at 18:56

score 4 · Accepted Answer · answered Aug 07 '16 at 18:59

4

Every time your Android app sends a request to AWS Lambda (via AWS API Gateway I assume) the Lambda function will have to download the entire index file from S3 to the Lambda /tmp directory (where Lambda has a 512MB limit) and then perform a search against that index file. This seems extremely inefficient, and depending on how large your index file is, it might perform terribly or it might not even fit into the space you have available on Lambda.

I would suggest looking into the AWS Elasticsearch Service. This is a fully managed search engine service, based on Lucene, that you should be able to query directly from your Android application.

answered Aug 07 '16 at 18:59

Mark B

183,023
24
297
295

1

Is there a way to download the index into the /tmp directory once during initialization and then perform all operations from the index present in /tmp. My index files are lesser than 512 MB so the limit should not present any problems.The index files wont be updated so the size will always remain within the limits. – Aug 08 '16 at 07:22
@rivendellking that would depend on your Lambda containers being reused. How much search traffic do you get? At least 2 or 3 searches every minute? – Mark B Aug 08 '16 at 11:58
the app hasnt yet been released,but it is a small scale project so I wouldn't expect much traffic.The problem I am facing with elasticsearch is that the code i wrote uses Executors to pass the same query in a number of different combinations.With elastic search i would have to pass 3-4 requests for each query and aggregate the response at client side.This,even-though possible would prove quite cumbersome as it would involve discarding the java code and adapting the logic into the elasticsearch system. Ideally id like to transfer the pc workflow(java on existing index) to the server. – Aug 08 '16 at 12:13
I doubt you will see any container reuse given your amount of traffic. If you need to stick with the code you have already written, why not just run it on an EC2 instance? – Mark B Aug 08 '16 at 12:17
Sorry if this question seems stupid but I have absolutely zero knowledge about cloud services,how and where can I save the index while running an EC2 instance?I will probably be running on the free tier so will the ec2 instance be enough?Also could you point me to online resources where one can learn about AWS services? – Aug 08 '16 at 12:21
2

On the EC2 instance you would have an EBS volume mounted to the instance where your filesystem resides. You would save the index to that, and take EBS snapshots for backups. You could also do periodic copies to S3 for backups. Given the limited amount of traffic you say you will get, I think a t2.micro instance will probably be fine. I learned about AWS by reading the official documentation and experimenting with the different services, and filled-in the gaps in my knowledge by taking the acloud.guru courses. – Mark B Aug 08 '16 at 14:07
thanks for all the help man!One last question,since I am planning to run this setup via Android are there any android-based restrictions in achieving this,or a standard call via asynctask/retrofit be enough to handle things on the request side of things? – Aug 08 '16 at 15:25
There shouldn't be any Android specific restrictions you would need to worry about. – Mark B Aug 08 '16 at 15:30

score 0 · Answer 2 · answered May 16 '18 at 08:11

As you already have your index files in S3, you can direct your Lucene Index reader to point to a Location on S3.

String index = "/<BUCKET_NAME>/<INDEX_LOCATION>/";
String endpoint = "s3://s3.amazonaws.com/";
Path path = new com.upplication.s3fs.S3FileSystemProvider().newFileSystem(URI.create(endpoint), env).getPath(index);
IndexReader reader = DirectoryReader.open(FSDirectory.open(path))

You can either pass in client credentials in env or you can assign role to your Lambda function.

Ref: https://github.com/prathameshjagtap/aws-lambda-s3-index-search/blob/master/lucene-s3-searcher/src/com/printlele/SearchFiles.java

albogdano · Answer 3 · 2019-06-24T11:10:17.543

For Lucene indices less than 512MB you can experiment with lucene-s3directory.

As Mark said, on AWS Lambda you are limited to 512MB on /tmp. I think having a completely serverless search service is very desirable but until that limit is gone, we're stuck with EC2 for production deployments. Once you go with running Lucene on EC2, storing the index on S3 becomes pointless as you have access to EBS or ephemeral storage.

In case you want to try out S3Directory, here's how to get started:

S3Directory dir = new S3Directory("my-lucene-index");
dir.create();
// use it in your code in place of FSDirectory, for example
dir.close();
dir.delete();

Integrating Lucene Index and Amazon AWS

3 Answers3

Linked