0

I am new to MongoDB so bear with me please. I have a CSV file example.csv that looks like this:

Sample,Chromosome,Position,Reference,Mutation,ReadDepth
testfile_snp,chr1,69511,A,G,10
testfile_snp,chr1,924024,C,G,12
testfile_snp,chr1,924533,A,G,13
testfile_snp,chr1,942451,T,C,22
testfile_snp,chr1,946247,G,A,44
testfile_snp,chr1,952421,A,G,32
testfile_snp,chr1,953259,T,C,37
testfile_snp,chr1,953279,T,C,23
testfile_snp,chr1,961945,G,C,40
testfile_snp,chr1,966227,C,G,35

and I have many files with around 25k lines in each. I would like to query each line in MongoDB. In my database, Sample,Chromosome,Position,Reference,Mutation are indexed as compound indexes. I tried to look around for a solution and the only relevant thing I found is the following thread. I can change the format of the CSV to queries with the commands below:

gawk -i inplace -F',' '{print "db.TestCollection.find({\"Sample\": \"" $1 "\", \"Chromosome\": \"" $2 "\", \"Position\": " $3 ", \"Reference\": \"" $4 "\", \"Mutation\": \"" $5 "\"})"}' example.csv
sed -i "1s/.*/use TestDatabase/" example.csv
mv example.csv example.js

which will output:

use TestDatabase
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 69511, "Reference": "A", "Mutation": "G"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 924024, "Reference": "C", "Mutation": "G"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 924533, "Reference": "A", "Mutation": "G"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 942451, "Reference": "T", "Mutation": "C"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 946247, "Reference": "G", "Mutation": "A"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 952421, "Reference": "A", "Mutation": "G"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 953259, "Reference": "T", "Mutation": "C"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 953279, "Reference": "T", "Mutation": "C"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 961945, "Reference": "G", "Mutation": "C"})
db.TestCollection.find({"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 966227, "Reference": "C", "Mutation": "G"})

Then I can use this file to feed it to MongoDB:

mongo < example.js

Currently, that's the way I have so far to query each of the line. However, I found another thread in which I can do a bulk query using the IN operator. The issue is that it behaves as OR in all the fields given:

use TestDatabase
db.TestCollection({"Sample": { $in : ["testfile_snp", "sv37213_hg38"] }, "Chromosome": "chr1", "Position": { $in : [69270,182585422]}, "Reference" : {$in : ["A", "C"]}, "Mutation" : {$in : ["G", "T"]} } )

would give:

MongoDB shell version v4.0.8
connecting to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("fb07f25a-3a4f-4c32-bd4e-70f3c3129435") }
MongoDB server version: 4.0.8
switched to db TestDatabase
{ "_id" : ObjectId("5ca47c1e0953f323b3b9cac5"), "Sample" : "sv37213_hg38", "Chromosome" : "chr1", "Position" : 69270, "Reference" : "A", "Mutation" : "G", "ReadDepth" : 19 }
{ "_id" : ObjectId("5ca47c1e0953f323b3b9e10f"), "Sample" : "sv37213_hg38", "Chromosome" : "chr1", "Position" : 182585422, "Reference" : "C", "Mutation" : "T", "ReadDepth" : 66 }
{ "_id" : ObjectId("5ca47bca0953f323b39019b1"), "Sample" : "test-exome-1_hg38", "Chromosome" : "chr1", "Position" : 69270, "Reference" : "A", "Mutation" : "G", "ReadDepth" : 17 }
bye

As you see, this query returns 2 documents for sv37213_hg38 which is not my wish. I only wanted the position 182585422 to be printed.

Is there any function in mongo that allows to query in bulk the entire content of my file or do I have to do it for each line ?

user324810
  • 597
  • 8
  • 20
  • I'm not understanding your reasoning here. You only wanted the `182585422` document, so why use `$in` to query for two `Position`s? Aside from this, have you looked into [`mongoimport`](https://stackoverflow.com/questions/4686500/how-to-use-mongoimport-to-import-csv) for bulk CSV importing? – Adam Apr 03 '19 at 14:13
  • @Adam as I said, I want to query all the rows in my CSV file, I was thinking of passing all the rows in a single query instead of doing a query for each file – user324810 Apr 03 '19 at 14:22

1 Answers1

0

Rather than using $in, you can use $or and simply put the original queries that you were originally doing one-by-one.

$or: [
  {"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 924024, "Reference": "C", "Mutation": "G"}
  {"Sample": "testfile_snp", "Chromosome": "chr1", "Position": 924533, "Reference": "A", "Mutation": "G"}
]
klhr
  • 3,300
  • 1
  • 11
  • 25
  • Would he perform faster queries using this method ? – user324810 Apr 03 '19 at 16:13
  • Nope! If you take a look at the `explain` for that query, you'll see that it decomposes that query into the queries that you had originally. I believe that each connection gets a single thread within mongo, so one way of speeding this up might be to try opening up multiple connections and querying subsets of that document – klhr Apr 03 '19 at 16:16
  • If you're trying to look up 25k documents and aren't able to restrict the search in any way, you're going to need to do 25k searches. I'm sure there's a better way of querying & searching, but I think it will require understanding the layout of your data – klhr Apr 03 '19 at 16:20
  • Thanks for the clarification. Can you please explain how to open up multiple connections and perform separate queries ? This might be useful. Actually, the layout of my data is exactly similar to my CSV file. There are 6 fields and more than 100m documents. Each file contains 25k rows. So, your idea would be to split my file into multiple subset files and perform queries on them ? – user324810 Apr 03 '19 at 16:25