2

I have a csv file with array in string format as below:

date,name,criteria
2018-05-16,John,"[{'age':35},{'birthyear':1983}]"
2018-05-16,Jane,"[{'age':36},{'birthyear':1982}]"

I am using Python with pandas and numpy for processing this

I need to import this file into MongoDB collection in following format :

{
   "date":'2018-05-16',
   "name":"John",
   "criteria" : [
         {"age":35},
         {"birthyear" : 1983}
   ]
},
{
   "date":'2018-05-16',
   "name":"Jane",
   "criteria" : [
         {"age":36},
         {"birthyear" : 1982}
   ]
 }

`

I tried using json formatter , but after insertion into Mongodb I get the array to be same as in csv file.

I have tried following approaches:

#Approach 1
import pymongo
from pymongo import MongoClient
import pandas as pd
import numpy as np
import json
from datetime import datetime

df = pd.read_csv("file.csv")
records = json.loads(df.T.to_json()).values()
db.tmp_collection.insert_many(data.to_dict('record'))



#Approach 2
import pymongo
from pymongo import MongoClient
import pandas as pd
import numpy as np
import json
from datetime import datetime

df = pd.read_csv("file.csv")
data_json = json.loads(df.to_json(orient='records'))
db.tmp_collection.insert_many(data_json)

Both give following output in Mongodb collection :

{
"date" : "2018-05-16",
"name" : "John",
"criteria" : "[{age:35},{birthyear:1983}]"
}

Can you suggest some better way. P.S. i am new to Python.

Thanks in advance.

cartman619
  • 552
  • 4
  • 17
  • In the csv, the "criteria" values are not valid json because the strings (age, birthyear) aren't quoted. Is it possible for you to fix that? – Jacques Kvam May 16 '18 at 06:11

1 Answers1

0

As noted the main issue is the JSON "like" data within the "string" for the criteria lacks quotes around the keys. With quotes properly in place you can parse the string into a list the data is structured how you want.

You could actually run map and re.sub() over the existing list and replace the criteria with the parsed version.

Given the source data in the form you stated:

date,name,criteria
2018-05-16,John,"[{age:35},{birthyear:1983}]"
2018-05-16,Jane,"[{age:36},{birthyear:1982}]"

Then the important parts are:

df = pd.read_csv("file.csv")
records = json.loads(df.to_json(orient='records'))

pattern = r"({|,)(?:\s*)(?:')?([A-Za-z_$\.][A-Za-z0-9_ \-\.$]*)(?:')?(?:\s*):"

records = map(lambda x:
  dict(x.items() +
    {
      'criteria': json.loads(
        re.sub(pattern, "\\1\"\\2\":", x['criteria'])
      )
    }.items()
  ),
  records
)

Which is basically going through each item in the list which came out earlier and doing a substitution on the "string" to quote the keys in the objects. Then of course parses the now valid JSON string into a list of dictionary objects.

That would make data like:

[{u'criteria': [{u'age': 35}, {u'birthyear': 1983}],
  u'date': u'2018-05-16',
  u'name': u'John'},
 {u'criteria': [{u'age': 36}, {u'birthyear': 1982}],
  u'date': u'2018-05-16',
  u'name': u'Jane'}]

Which you can then pass to the insert_many() to create the documents in the collection keeping that format that you want.

db.tmp_collection.insert_many(records)

Attribution to regular expression to add double quotes around keys in javascript for the regular expression pattern used here.

Personally I would take that a little further and at least parse to datetime:

records = map(lambda x:
  dict(x.items() +
    {
      'date': datetime.strptime(x['date'], '%Y-%m-%d'),
      'criteria': json.loads(
        re.sub(pattern, "\\1\"\\2\":", x['criteria'])
      )
    }.items()
  ),
  records
)

MongoDB will use a BSON Date when inserted into the collection, and that's a lot more useful than a string.

And again, "personally" I would not be using "named keys" inside a list for MongoDB. Instead I would rather "remap" to something more standard like "k" and "v" as in:

records = map(lambda x:
  dict(x.items() +
    {
      'date': datetime.strptime(x['date'], '%Y-%m-%d'),
      'criteria':
      [i for s in map(lambda y: [{ 'k': k, 'v': v } for k,v, in y.iteritems()] , json.loads(
        re.sub(pattern, "\\1\"\\2\":", x['criteria'])
      )) for i in s]
    }.items()
  ),
  records
)

Which gives a structure like:

[{u'criteria': [{'k': u'age', 'v': 35}, {'k': u'birthyear', 'v': 1983}],
  u'date': datetime.datetime(2018, 5, 16, 0, 0),
  u'name': u'John'},
 {u'criteria': [{'k': u'age', 'v': 36}, {'k': u'birthyear', 'v': 1982}],
  u'date': datetime.datetime(2018, 5, 16, 0, 0),
  u'name': u'Jane'}]

With the main reason being that "querying" with MongoDB is going to be a lot more useful if the path is more consistent like that.

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
  • getting Error `JSONDecodeError: Expecting value: line 1 column 24 (char 23)` for inserting into collection. Tried using json.dumps in place of json.loads but that gives `TypeError: unsupported operand type(s) for +: 'dict_items' and 'dict_items'` error – cartman619 May 16 '18 at 09:40
  • @cartman619 Not from the items you included in the question. Works fine for me, otherwise I would not have posted it here. I don't have any code that says `dict_items`. Run on something different to what you actually asked at your own peril. – Neil Lunn May 16 '18 at 09:43
  • @cartman619 Did you change your CSV data source? You changed it in the question. If you did actually add the quotes within the CSV itself then you don't need the `re.sub` line in here, and can merely pass the `x['criteria']` to the `json.loads()`. The whole point of that substitution was that your original posted data does not have those quotes on the keys to make it valid JSON for parsing. – Neil Lunn May 17 '18 at 01:49
  • yes i have changed to csv to reflect the correctness, sorry for that. I commented the `re.sub` part and still get the `JSONDecodeError: Expecting value: line 1 column 24 (char 23)` error when inserting in collection. Also is there any way i can print `records` after the `json.loads()` , so that i can check why this error – cartman619 May 17 '18 at 09:05
  • @cartman619 What do you mean by "when inserting the collection"? `records` is simply a list of dicts and that is all you need to pass in to `.insert_many()` without doing anything else. That's exactly what I am showing you but you seem to be implying you are trying to run `loads()` on that content. You don't do that as the pymongo method expects "python data" and not JSON. And of course you can print it. I'm showing the `pprint` output within the answer, so that's how I'm doing it. – Neil Lunn May 17 '18 at 09:08