1

I have a problem in importing a very big XML file with 36196662 lines. I am trying to create a Neo4j Graph Database of this XML file with Py2neo my xml file look like that:

https://i.stack.imgur.com/mhZ1z.jpg

and My python code to import the xml data into Neo4j is like that:

from xml.dom import minidom
from py2neo import Graph, Node, Relationship, authenticate
from py2neo.packages.httpstream import http
http.socket_timeout = 9999
import codecs

authenticate("localhost:7474", "neo4j", "******")

graph = Graph("http://localhost:7474/db/data/")

xml_file = codecs.open("User_profilesL2T1.xml","r", encoding="latin-1")

xml_doc = minidom.parseString (codecs.encode (xml_file.read(), "utf-8"))

#xml_doc = minidom.parse(xml_file)
persons = xml_doc.getElementsByTagName('user')
label1 = "USER"

# Adding Nodes
for person in persons:


    if person.getElementsByTagName("id")[0].firstChild:
       Id_User=person.getElementsByTagName("id")[0].firstChild.data
    else: 
       Name="NO ID"
    print ("******************************USER***************************************")
    print(Id_User)



    print ("*************************")
    if person.getElementsByTagName("name")[0].firstChild:
       Name=person.getElementsByTagName("name")[0].firstChild.data
    else: 
       Name="NO NAME"   
   # print("Name :",Name)


    print ("*************************")
    if person.getElementsByTagName("screen_name")[0].firstChild:
       Screen_name=person.getElementsByTagName("screen_name")[0].firstChild.data
    else: 
       Screen_name="NO SCREEN_NAME" 
  #   print("Screen Name :",Screen_name)

    print ("*************************") 
    if person.getElementsByTagName("location")[0].firstChild:
       Location=person.getElementsByTagName("location")[0].firstChild.data
    else: 
       Location="NO Location"   
 #    print("Location :",Location)


    print ("*************************")
    if person.getElementsByTagName("description")[0].firstChild:
       Description=person.getElementsByTagName("description")[0].firstChild.data
    else: 
       Description="NO description" 
  #   print("Description :",Description)


    print ("*************************") 
    if person.getElementsByTagName("profile_image_url")[0].firstChild:
       Profile_image_url=person.getElementsByTagName("profile_image_url")[0].firstChild.data
    else: 
       Profile_image_url="NO profile_image_url" 
   # print("Profile_image_url :",Profile_image_url)

    print ("*************************")
    if person.getElementsByTagName("friends_count")[0].firstChild:
       Friends_count=person.getElementsByTagName("friends_count")[0].firstChild.data
    else: 
       Friends_count="NO friends_count" 
 #    print("Friends_count :",Friends_count)


    print ("*************************")
    if person.getElementsByTagName("url")[0].firstChild:
       URL=person.getElementsByTagName("url")[0].firstChild.data
    else: 
       URL="NO URL" 
  #   print("URL :",URL)






    node1 = Node(label1,ID_USER=Id_User,NAME=Name,SCREEN_NAME=Screen_name,LOCATION=Location,DESCRIPTION=Description,Profile_Image_Url=Profile_image_url,Friends_Count=Friends_count,URL=URL)
    graph.merge(node1)  

My problem is when i run the code, it's take a long time to import this file almost a week to do that, so if can anyone help me to import data more faster than that i will be very grateful.

NB: My laptop configuration is: 4Gb RAM, 500Gb Hard Disc, i5

ucmou
  • 79
  • 1
  • 1
  • 8

2 Answers2

2

If you are importing data into a new database you may want to try the import-tool: https://neo4j.com/docs/operations-manual/current/#import-tool

In that case you should parse your XML file as you already do but instead of using py2neo to insert data into Neo4j, just write a CSV file and then call the import-tool afterwards.

See below a possible way to do it:

import csv
from xml.dom import minidom

def getAttribute(node,attribute,default=None):
    attr = node.getElementsByTagName(attribute)[0]
    return attr.firstChild.data if attr.firstChild else default

xml_doc = minidom.parse(open("users.xml"))
persons = xml_doc.getElementsByTagName('user')

users = []
attrs = ['name','screen_name','location','description','profile_image_url','friends_count','url']

mapping = {'user_id': 'user_id:ID(User)',
           'name': 'name:string',
           'screen_name': 'screen_name:string',
           'location': 'location:string',
           'description': 'description:string',
           'profile_image_url': 'profile_image_url:string',
           'friends_count': 'friends_count:int',
           'url': 'url:string'}

with open('users.csv','w') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=mapping.values())
    writer.writeheader()
    for person in persons:
        user = {mapping[attr]: getAttribute(person, attr) for attr in attrs}
        user[mapping['user_id']] = getAttribute(person, 'id')

        writer.writerow(user)

Once you have converted the xml to a csv file, run the import-tool:

$ neo4j-import --into neo4j-community-3.0.3/data/databases/users.db --nodes:User users.csv

I guess you will also want to create relationships between nodes (?). You should read the import-tool docs and call the import-tool with csv files for both nodes and relationships

svidela
  • 91
  • 5
  • Thank you Sancho for your response, but i don't know how to do that, if you can suggest me a code to do it, it would be very nice. – ucmou Jun 30 '16 at 23:13
  • I just edited my previous response with an example. I hope it helps. – svidela Jul 01 '16 at 15:45
  • Thanks a lot Sancho for your help – ucmou Jul 02 '16 at 20:52
  • Hello Sancho when i tested your solution it's take a long time to convert from xml to csv and my laptop bug everytime ! – ucmou Jul 03 '16 at 09:07
  • I don't know what do you mean by "my laptop bug everytime"... Does it run out of memory? I don't have that much experience parsing big xml files, but it seems that you should try some alternative ways to minidom which avoid loading the whole xml file in memory: http://stackoverflow.com/questions/9856163/using-lxml-and-iterparse-to-parse-a-big-1gb-xml-file, http://boscoh.com/programming/reading-xml-serially.html, https://github.com/martinblech/xmltodict – svidela Jul 04 '16 at 14:24
  • "my laptop bug every time" means that My computer freezes every time when I run the code. – ucmou Jul 04 '16 at 15:14
  • right, so I guess you're running out of memory when minidom loads the xml file into memory. You may want to take a look to the links I posted above – svidela Jul 04 '16 at 19:30
2

I think you should use a streaming parser, otherwise it might be even on the python side that you overflow on memory.

Also I recommend doing transactions in Neo4j with batches of 10k to 100k updates per transaction.

Don't store "NO xxxx" fields, just leave them off it is just a waste of space and effort.

I don't know how merge(node) works. I recommend creating a unique constraint on :User(userId) and using a cypher query like this:

UNWIND {data} as row
MERGE (u:User {userId: row.userId}) ON CREATE SET u += {row}

where {data} parameter is a list (e.g. 10k entries) of dictionaries with the properties.

Michael Hunger
  • 41,339
  • 3
  • 57
  • 80
  • Thank you Michael for your response. About your recommendation, how i can do that with py2neo, if you can help me for the code i will be very grateful. – ucmou Jul 06 '16 at 01:53