0

I am dealing with some very large third party CSV files which I import in Mongo. The issue is that the headers on these files are very inconsistent. Some files contain all lowercase, some mixed case and some all uppercase headers. After the file is imported, is there a way to query a collection case insensitively?

In short can I do:

 @collection.find('AB' => value) so that it's true for fields 'ab' or aB' as well?

Is there a way to normalize these headers at the time of import? or may be there's another way?

The question suggested as a duplicate is for regex searches on values and not the field itself

Rahul
  • 321
  • 2
  • 14
  • @mudasobwa the question you've linked is for searches on values not fields – Rahul Jul 06 '17 at 16:47
  • 1
    What do you mean “values not fields”? While schemaless, `mongo` has case sensitive keys and `{"a" => 42, "A" => 3.14}` is a perfectly valid record having two different fields. I will unmark the question as duplicate, but you should consider downcasing your keys on the insertion stage. – Aleksei Matiushkin Jul 06 '17 at 16:55
  • Normalize on insertion as suggested by @mudasobwa. Your current question is equivalent to "I would like to find a value by its column but i don't know the name of the column or the value of the value" – engineersmnky Jul 06 '17 at 17:10
  • @mudasobwa is there a way to downcase keys on insertion during import? – Rahul Jul 07 '17 at 08:53
  • @engineersmnky I understand what you are saying but I do know the keys, its just that I don't know what case they are in due to the poor quality external data I am dealing with. I am looking for a way to either upcase or down case these on import – Rahul Jul 07 '17 at 08:57
  • Show how you do import the data. I am not a fortuneteller. – Aleksei Matiushkin Jul 07 '17 at 09:24
  • @mudasobwa I use mongoimport task. `mongoimport -d db -c collection --type csv --headerline --file file.csv` The csv file is what has headers that are mixed case – Rahul Jul 07 '17 at 11:50
  • 1
    Do `ruby -e "File.write('/path/to/file', File.readlines('/path/to/file').tap { |l| l[0].downcase! })"` upfront to fix the titles to be downcased. – Aleksei Matiushkin Jul 07 '17 at 12:24
  • @mudasobwa beat me to it. OP as stated above just preprocess the file to ensure the headers are downcased first – engineersmnky Jul 07 '17 at 12:33
  • @mudasobwa thanks for this. Looks like something that could work for me. I will give it a shot. The only thing is the files I have are rather large. Will this not read the entire file just to down case the first line? – Rahul Jul 07 '17 at 22:16
  • Yes it will. There are many possibilities to not load the entire file into memory, which this margin is too small to contain. You might ask the different question, but I am positive you’ll discover the solution yourself. – Aleksei Matiushkin Jul 08 '17 at 03:58
  • Thanks @mudasobwa. I have seen that there is a readline method that only reads the first line of the file. If you post your solution as an answer I will be happy to accept – Rahul Jul 08 '17 at 09:48

1 Answers1

-2

You should use the built-in Ruby methods for the String class documentation. For example, .strip!, would be a good one to use to remove excess whitespace. .downcase! could be used to normalize the case. There are others you may find useful to use in the documentation. This is one of the best resources for beginner to intermediate level Ruby programming. Once you master the methods of each class, you can leverage the full power of Ruby.