-1

i've following 10 lines in a .xml file :

<EmpInfo Location="Pune" Name="John">
<EmpInfo>
<EmpInfo Location="Pune" Name="Sam">
<EmpInfo>
<EmpInfo Location="Pune" Name="George">
<EmpInfo>
<EmpInfo Location="Mumbai" Name="Sera">
<EmpInfo>
<EmpInfo Location="Delhi" Name="Jon">
<EmpInfo>
<EmpInfo Location="Mumbai" Name="Josh">
<EmpInfo>
<EmpInfo Location="Pune" Name="Alex">
<EmpInfo>
<EmpInfo Location="Mumbai" Name="Lee">
<EmpInfo>
<EmpInfo Location="Delhi" Name="Ron">
<EmpInfo>
<EmpInfo Location="Mumbai" Name="Sara">
<EmpInfo>

i've tried this way.. but it's not working :

counter=0
infoDict={}
pointers = header.getElementsByTagName('EmpInfo')
for pointer in pointers:
    namelist=[]
    pointerobj={}
    if counter==0:
        name=pointer.getAttribute("Location")        
        basename=pointer.getAttribute("Name")
        namelist.append(name)
        basenamelist.append(basename)            
    else:
        basename=pointer.getAttribute("Location")
        if pointer.getAttribute("Location") in basenamelist:
            name=pointer.getAttribute("Name")

            namelist.append(name)
        else:
            name=pointer.getAttribute("Name")
        namelist.append(name)
    #basenamelist.append(basename)
    print("Location:: ",basename)
    print("Name:: ",namelist)
    counter=counter+1
infoDict.update({basename:namelist})

I want result to get printed in dictionary like :

infoDict = {
    Pune : [John,Sam,George,Alex],
    Mumbai : [Sera,Josh,Lee,Sara],
    Delhi : [Jon,Ron]
}

I'm trying to insert this result in mongoDB. in dictionary key must be location and value should be array. my actual application is very long but i want to complete this small module in it first.

Kittu
  • 43
  • 1
  • 9

1 Answers1

1

Here is a code using re for regular expressions and pandas for data management (with a file named my_file.txt, to be replaced with your file name):

import pandas as pd
with open("my_file.txt", 'r') as f:
    file_str = f.read()
    tuples = re.findall('<EmpInfo Location="([A-Za-z]+)" Name="([A-Za-z]+)">',file_str)
    df = pd.DataFrame(tuples )
    df_grouped = df.groupby(0,sort=False)[1].apply(lambda x: list(x))
df_grouped 
#0
#Pune      [John, Sam, George, Alex]
#Mumbai      [Sera, Josh, Lee, Sara]
#Delhi                    [Jon, Ron]
#Name: 1, dtype: object

Or if you prefer, a two liner:

import pandas as pd
with open("my_file.txt", 'r') as f:
    df_grouped = pd.DataFrame(re.findall('<EmpInfo Location="([A-Za-z]+)" Name="([A-Za-z]+)">',f.read())).groupby(0,sort=False)[1].apply(lambda x: list(x))

For some fancy printing (instead of printing you can write it into a new file):

for idx, row in df_grouped.T.iteritems():
    print(f"{idx} : [{','.join(row)}]")
#Pune : [John,Sam,George,Alex]
#Mumbai : [Sera,Josh,Lee,Sara]
#Delhi : [Jon,Ron]
ibarrond
  • 6,617
  • 4
  • 26
  • 45
  • i'm using Python "Python 3.7.5", it's giving error for import pandas as pd. how to install this module? – Kittu Feb 06 '20 at 10:21