I am new to MongoDB so bear with me please. I have a CSV file example.csv
that looks like this:
Sample,Chromosome,Position,Reference,Mutation,ReadDepth
testfile_snp,chr1,69511,A,G,10
testfile_snp,chr1,924024,C,G,12
testfile_snp,chr1,924533,A,G,13
testfile_snp,chr1,942451,T,C,22
testfile_snp,chr1,946247,G,A,44
testfile_snp,chr1,952421,A,G,32
testfile_snp,chr1,953259,T,C,37
testfile_snp,chr1,953279,T,C,23
testfile_snp,chr1,961945,G,C,40
testfile_snp,chr1,966227,C,G,35
and I have many files with around 25k lines in each. I would like to query each line in MongoDB. In my database, Sample,Chromosome,Position,Reference,Mutation
are indexed as compound indexes
. I tried to look around for a solution and the only relevant thing I found is the following thread. I can change the format of the CSV to queries with the commands below:
gawk -i inplace -F',' '{print "db.TestCollection.find({\"Sample\": \"" $1 "\", \"Chromosome\": \"" $2 "\", \"Position\": " $3 ", \"Reference\": \"" $4 "\", \"Mutation\": \"" $5 "\"})"}' example.csv
sed -i "1s/.*/use TestDatabase/" example.csv
mv example.csv example.js
which will output:
use TestDatabase
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 69511, "Reference": "A", "Mutation": "G"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 924024, "Reference": "C", "Mutation": "G"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 924533, "Reference": "A", "Mutation": "G"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 942451, "Reference": "T", "Mutation": "C"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 946247, "Reference": "G", "Mutation": "A"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 952421, "Reference": "A", "Mutation": "G"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 953259, "Reference": "T", "Mutation": "C"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 953279, "Reference": "T", "Mutation": "C"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 961945, "Reference": "G", "Mutation": "C"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 966227, "Reference": "C", "Mutation": "G"})
Then I can use this file to feed it to MongoDB:
mongo < example.js
Currently, that's the way I have so far to query each of the line. However, I found another thread in which I can do a bulk query using the IN
operator. The issue is that it behaves as OR
in all the fields given:
use TestDatabase
db.TestCollection({"Sample": { $in : ["testfile_snp", "sv37213_hg38"] }, "Chromosome": "chr1", "Position": { $in : [69270,182585422]}, "Reference" : {$in : ["A", "C"]}, "Mutation" : {$in : ["G", "T"]} } )
would give:
MongoDB shell version v4.0.8
connecting to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("fb07f25a-3a4f-4c32-bd4e-70f3c3129435") }
MongoDB server version: 4.0.8
switched to db TestDatabase
{ "_id" : ObjectId("5ca47c1e0953f323b3b9cac5"), "Sample" : "sv37213_hg38", "Chromosome" : "chr1", "Position" : 69270, "Reference" : "A", "Mutation" : "G", "ReadDepth" : 19 }
{ "_id" : ObjectId("5ca47c1e0953f323b3b9e10f"), "Sample" : "sv37213_hg38", "Chromosome" : "chr1", "Position" : 182585422, "Reference" : "C", "Mutation" : "T", "ReadDepth" : 66 }
{ "_id" : ObjectId("5ca47bca0953f323b39019b1"), "Sample" : "test-exome-1_hg38", "Chromosome" : "chr1", "Position" : 69270, "Reference" : "A", "Mutation" : "G", "ReadDepth" : 17 }
bye
As you see, this query returns 2 documents for sv37213_hg38
which is not my wish. I only wanted the position 182585422
to be printed.
Is there any function in mongo that allows to query in bulk the entire content of my file or do I have to do it for each line ?