1

I want to synchronize mongodb and hadoop, but when I delete document from mongodb, this document must not be deleted in hadoop.

I tried using mongo-hadoop and hive. this is hive query:

CREATE EXTERNAL TABLE SubComponentSubmission
(
  id STRING,
  status INT,
  providerId STRING,
  dateCreated TIMESTAMP,
  subComponentId STRING,
  packageName STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'=
                    '{"id":"_id", "status":"Status", 
                      "providerId":"ProviderId", 
                      "dateCreated":"DateCreated", 
                      "subComponentId":"SubComponentPackage.SubComponentId", 
                      "packageName":"SubComponentPackage.PackageName"}'
                    )
TBLPROPERTIES('mongo.uri'='mongodb://<host>:27017/<db name>.<collection name>');

this query creates table that is synchronized to corresponding mongodb collection. by this query mongo-hadoop handles document deletion too.

does mongo-hadoop have any option, not to handle document deletion? or, is there any other tool that solves this problem?

thanks in advance.

irakli2692
  • 127
  • 2
  • 9

1 Answers1

1

If you query directly against mongo like you're doing, yes, you're going to see all the document mutations that happen in mongo. That's the whole point of querying against mongo like this. If you want snapshotted views of your mongo data, you'll need to do something like a mongodump and putting the bson files on disk somewhere (like HDFS). Otherwise you'll always be querying against the live, mutating data.

evanchooly
  • 6,102
  • 1
  • 16
  • 23