0

I have a very large data set that needs a database, and I am using Windows10 PowerShell as part of my development environment for the CLI and shell interface with both MongoDB and Python;

I have chosen MongoDB as the best solution for the needs of this data and future analysis with Python;

Within the data, I have Python tuples to contain units of 3 numbers with each separated by a comma - (x, y, z) - and therefore it was necessary to choose semicolons ; as the delimiter for the CSV file; Furthermore, MongoDB also has issues with parentheses, so I refactored to be Python Lists/Arrays [x, y, z]; However, MongoDB apparently has an issue with this, so I simplified the data to be a simple text string representation "[x, y, z]" in order to first isolate the issues described here below:

Evidently, MongoDB has an issue with importing CSV files with semicolons as delimiters, so I have had to recalculate the data to a new CSV file using tabs "\t" as the delimiter;

Using mongosh shell commands below (together with the status code messages), I am able to successfully mongoimport the CSV file to create the database with the collection;

./mongoimport --type tsv -d testdbcsv -c testcollection --headerline --drop dataCSV.csv
PS C:\mongodbdbtools\bin> ./mongoimport --type tsv -d testdbcsv -c testcollection --headerline --drop dataCSV.csv
2022-01-10T11:24:01.487+0200    connected to: mongodb://localhost/
2022-01-10T11:24:01.535+0200    dropping: testdbcsv.testcollection
2022-01-10T11:24:01.611+0200    30 document(s) imported successfully. 0 document(s) failed to import.
PS C:\mongodbdbtools\bin>

For the challenge - and to test and learn new aspects of MongoDB - I also created a JSON file of the exact same data in the CSV in order to see and test if the two MongoDB databases that I create from two separate files (with the same data) would produce identical MongoDB databases;

I have already tested and validated the JSONObjectForExport as a valid JSON file which is a JSON text string containing a List/Array of Dictionaries/Objects;

However, using the two Windows PowerShell commands to mongoimport the JSON file, I am getting the following messages depending upon if I add the --jsonArray argument;

./mongoimport -d testdbjson -c testcollection --drop dataJSON.json --jsonArray
PS C:\mongodbdbtools\bin> ./mongoimport -d testdbjson -c testcollection --drop dataJSON.json --jsonArray
2022-01-10T11:25:05.513+0200    connected to: mongodb://localhost/
2022-01-10T11:25:05.562+0200    dropping: testdbjson.testcollection
2022-01-10T11:25:05.563+0200    Failed: error reading separator after document #1: bad JSON array format - found no opening bracket '[' in input source
2022-01-10T11:25:05.563+0200    0 document(s) imported successfully. 0 document(s) failed to import.
PS C:\mongodbdbtools\bin> 
./mongoimport -d testdbjson -c testcollection --drop dataJSON.json
PS C:\mongodbdbtools\bin> ./mongoimport -d testdbjson -c testcollection --drop dataJSON.json
2022-01-10T11:25:24.816+0200    connected to: mongodb://localhost/
2022-01-10T11:25:24.864+0200    dropping: testdbjson.testcollection
2022-01-10T11:25:24.867+0200    Failed: cannot decode string into a primitive.D
2022-01-10T11:25:24.867+0200    0 document(s) imported successfully. 0 document(s) failed to import.
PS C:\mongodbdbtools\bin>

Here is an example of a simplified JSON file with the JSON text string format:

"[{\"KeyField1\": 0, \"KeyField2\": \"[0, 1, 2]\"}]"

So the CSV file imports fine; the JSON file does not import even though it is valid JSON of the exact same data that is being written to both CSV and JSON files;

What could be the issue(s) here?

Thanks for the help;

EDIT: More details of the Python code that produces the JSON File per request to clarify:

## CONVERT/SERIALIZE DICTIONARY TO JSON:
JSONObjectForExport = json.dumps(ListOfDictionaries)

## BEGIN VALIDATE JSON
## BEGIN DEFINE FUNCTION
def validateJSON(jsonData):
    try:
        json.loads(jsonData)
    except ValueError as err:
        return(False)
    return(True)

## END DEFINE FUNCTION

## CALL FUNCTION
isValid = validateJSON(JSONObjectForExport)

## TEST PRINT OUTPUT
print("Given JSON string is Valid", isValid)

## END VALIDATE JSON

## WRITE DATA AS JSON FILE - 
with open("dataJSON.json", "w") as f_json:
    json.dump(JSONObjectForExport, f_json) 

EDIT: Summary of facts and issues:

1.) Python calculates data and writes the same data to valid CSV and valid JSON format files.

2.) MongoDB successfully imports data from CSV to create db.

3.) MongoDB fails in two ways with two different error messages because neither code works to import same data in JSON format.

  • 1
    Make an export with `monogexport` and compare the exported file with your JSON file, then you should see the difference/problem. – Wernfried Domscheit Jan 10 '22 at 09:48
  • 1
    Wait, so your JSON file contains a single **string** with a JSON array inside? – Sergio Tulentsev Jan 10 '22 at 09:57
  • More details in the edit per your request @Sergio Tulentsev – Jerusalem Programmer Jan 10 '22 at 10:14
  • 1
    Well yes, write an array, not a string. That's what mongoimport is telling you. – Sergio Tulentsev Jan 10 '22 at 10:22
  • I am writing a JSON file according to the Python library for JSON. As you can see, I have tried both options of ```mongoimport``` with both ```--jsonArray``` and without: I exported the data as both CSV and JSON from the same program (with the code above you can see for the JSON), yet no import command is working; I searched for the error messages, but did not find an exact case like this where the same data imports successfully with CSV, but not JSON; – Jerusalem Programmer Jan 10 '22 at 10:27
  • What is the proper format to output and write to file from the Python so that it successfully imports in MongoDB? – Jerusalem Programmer Jan 10 '22 at 10:29
  • @JerusalemProgrammer I don't know how to make it any clearer, but there's a difference between `"[1, 2, 3]"` and `[1, 2, 3]`. You're producing the former, but mongoimport expects the latter. – Sergio Tulentsev Jan 10 '22 at 10:41
  • So then I need to clarify: I put quotes to show you that it was text string representation of the same characters for the array; it does not matter what is between quotes; it is simple string as data and not relevant; the same data of text strings and numbers only do not import from JSON format but the exact same data that you see imports fine from CSV. – Jerusalem Programmer Jan 10 '22 at 10:52
  • @JerusalemProgrammer: Now I see how one could be confused by my last comment. Ok, last time: there's a difference between `"[{...}, {...}, ...]"` and `[{...}, {...}, ...]`. You're producing the former, but mongoimport expects the latter. – Sergio Tulentsev Jan 10 '22 at 11:00
  • Hi Sergio, you are not understanding: MongoDB does not expect anything. I am giving text strings as values in quotes. Again: the same data works fine in CSV. – Jerusalem Programmer Jan 10 '22 at 14:04
  • @JerusalemProgrammer I'm not talking about field values, though. And mongodb does expect _something_, evidently. As seen in the error message, "bad JSON array format - found no opening bracket '[' in input source" – Sergio Tulentsev Jan 10 '22 at 16:59
  • That is one error message of two as detailed above; Yes, that is one error message and you see sample data has the square brackets so I do not understand either of the TWO error messages depending on the two options tried. – Jerusalem Programmer Jan 10 '22 at 17:27
  • @Wernfried Domsheit I tried your suggestion, but was only able to export MongoDB format with _id, and it is not same data since it is not pure JSON format of the pure CSV data that I imported, so this creates another issue that I can investigate later after first solving this issue how to import JSON into MongoDB. – Jerusalem Programmer Jan 10 '22 at 17:42

2 Answers2

0

So I tested a few things, and even though the above JSONObjectForExport is valid JSON produced by Python, for whatever reason, MongoDB does not like the backslash escaped quotes, so I will investigate that next and update here asap:

The following work:

FOR SINGLE JSON OBJECTS:

./mongoimport -d testdbjson -c testcollection --drop dataJSON.json

This code above works for the JSON Object below:

{"KeyField1": 0, "KeyField2": [0, 1, 2]}

OR

{"KeyField1": 0, "KeyField2": "[0, 1, 2]"}

FOR ARRAYS OF JSON OBJECTS:

./mongoimport -d testdbjson -c testcollection --drop dataJSON.json --jsonArray

This code above works for the JSON Object as array below:

[{"KeyField1": 0, "KeyField2": [0, 1, 2]}]

OR

[{"KeyField1": 0, "KeyField2": "[0, 1, 2]"}]
  • Ah, good to see you finally managed to get the correct content. Only your conclusion is all wrong. "backslash escaped quotes" is not what was giving you problems. It was you having a string instead of an array. – Sergio Tulentsev Jan 11 '22 at 15:04
  • @Sergio Tulentsev as you can see: values with both string or array work; the issue remains semi-unsolved because what is not working is the JSON file produced by Python when I write the data to JSON file as per code in the question; For debugging I like to simplify things and wrote my own JSON to test; so now I need to examine the Python output and figure out why it tests as valid JSON yet is rejected by MongoDB. – Jerusalem Programmer Jan 11 '22 at 16:27
  • Your code outputs a JSON string. Whereas mongodb expects a JSON array. Both are "valid JSON". – Sergio Tulentsev Jan 11 '22 at 16:29
  • If you still can't see it, put your generated file right next to your handcrafted file and compare them character by character. – Sergio Tulentsev Jan 11 '22 at 16:31
0

Okay, so after testing a few things this morning, the issue was double encoding of the data to JSON as per issue here with an alternate solution:

Dump to JSON adds additional double quotes and escaping of quotes

Simply put: it is unnecessary to use both json.dumps() and json.dump().

The following code showing that I commented out the unnecessary json.dumps() step (and replaced it with just dumping the ListOfDictionaries) solves the problem, and now MongoDB imports the JSON Object fine.

CONCLUSION / THEORY to be tested and confirmed: MongoDB does not like the double-encoding with backslash-escaped characters:

## CONVERT/SERIALIZE DICTIONARY TO JSON:
## JSONObjectForExport = json.dumps(ListOfDictionaries) ## UNNECESSARY TO ENCODE JSON HERE
JSONObjectForExport = ListOfDictionaries

## WRITE DATA AS JSON FILE - 
with open("dataJSON.json", "w") as f_json:
    json.dump(JSONObjectForExport, f_json)