2

I have parsed an xml file containing some part-of-speech tagged text and I since the file is not perfect I am adding the data to a pandas dataframe in order to later clean it.

At this point I will need to duplicate some rows based on certain values and modify only one or two values in the duplicated row and in the original one.

This is what the actual dataframe looks like:

In [8]: df.head()
Out[8]: 
      text     lemma       pos markintext  doublemma  multiwordexpr nodetail
0      Per       per      epsf          0          0              0        0
1   correr   correre    vta2fp          0          0              0        0
2  miglior  migliore      a2fp          0          0              0        0
3    acque     acqua     sf1fp          0          0              0        0
4     alza    alzare  vta1ips3          0          0              0        0

Now, if, for example, multiwordexpr is equal to 1, I want to duplicate the row and insert it in the database. So, I would like to go from this:

In [10]: df[df['multiwordexpr'] == 1]
Out[10]: 
          text     lemma      pos markintext  doublemma  multiwordexpr
16    dietro a  dietro a   eilksl          0          0              1  

to this:

          text     lemma      pos markintext  doublemma  multiwordexpr
16    dietro    dietro a   eilksl          0          0              1  
17    a         dietro a   eilksl          0          0              1  

This is my code

#!/usr/bin/python
# -*- coding: latin-1 -*-

from lxml import etree
import locale
import sys
import os
import glob
import pandas as pd
import numpy as np
import re
from string import punctuation
import random
import unicodedata

def manage_tail(taillist):
    z = []
    for line in taillist:
        y = list(line.strip())
        for punkt in y:
            z.append(punkt)
    return z if len(z) > 0 else 0

def checkmark(text):
    pattern = re.compile("\w|'",re.UNICODE)
    if re.match(pattern,text[-1]):
        return 0
    else:
        return text[-1]

path = "~/working_corpus/"
output_path = "~/devel_output/"
f = "*.xml"

docs = [f for f in glob.glob(os.path.join(path,f))]
parser = etree.XMLParser(load_dtd= True,resolve_entities=True)

x = []
for d in docs:

    tree = etree.parse(d,parser)

    for node in [z for z in  tree.iterfind(".//LM")]:
        text = node.text.strip()
        multiwordexpr = 1 if (' ' in text.replace('  ', ' ')) else 0
        lemma = node.get('lemma')
        markintext = checkmark(text)
        pos = node.get('catg')
        doublemma = 1 if (node.getparent() is not None and node.getparent().tag == 'LM1') else 0
        nodetail = manage_tail(node.tail.splitlines()) if node.tail else None
        row = [text,lemma,pos,markintext,doublemma,multiwordexpr,nodetail]
        x.append(row)


df = pd.DataFrame(x,columns=('text','lemma','pos','markintext','doublemma','multiwordexpr','nodetail'))

I've thought about something like this for managing the case in which nodetail is true(so not the multiwordexpr problem exactly, but the point is the same: how to efficiently add a row in an arbitrary position, so not at the end), but I don't know how to really do it efficiently. I am looking for a function that given one or more condition, inserts a certain number of duplicated rows under the selected row and modifyes one or two values in the other columns (in this case, it splits the text and duplicates the row).

l = []
i = 0
while i < len(df):
    if (df.iloc[i,6] != 0):
        ntail = df.iloc[i,6]
        df.iloc[i,6] = 0
        i += 1
        for w in range(len(ntail)):
            line = pd.DataFrame({'text': ntail[w],
            'lemma': ntail[w],
            'pos':'NaN',
            'markintext':0,
            'doublemma':0,
            'multiwordexpr':0,
            'nodetail':0},index=[i+w], columns=('text','lemma','pos','markintext','doublemma','multiwordexpr','nodetail'))
            l.append(line)
    else:
        pass
    i += 1
    sys.stdout.write("\r%d/%d" % (i,len(df)))
    sys.stdout.flush()
print "...done extracting."

for i in range(len(l)):    
    start = int((l[i].index[0])-1)
    end = int(l[i].index[0])
    df = pd.concat([df.ix[:start], l[i], df.ix[end:]]).reset_index(drop=True)
    sys.stdout.write("\r%d/%d" % (i,len(l)))
    sys.stdout.flush()
Angelo
  • 767
  • 1
  • 6
  • 21

1 Answers1

0

EDIT: You can preallocate your df, the required length will be len(df)+df.multiwordexpr.sum() then you can use .ix[] to set the correct rows. You still have to iterate your original df and split it though. That might be faster.

row = ['','','',0,0,0,0]
#calculate correct length depending on your original df
df_len = len(orig_df)+orig_df.multiwordexpr.sum()

#allocate a new df
result_df = pd.DataFrame([row for x in xrange(df_len)],
                      columns=columns)
#write to it instead appending
result_df.ix[index] = ['Per','per','epsf',0,0,0,0]

EDIT END

Maybe creating a new dataframe and only appending to it will be faster than modifying the original?

You could iterate your original df and append to a new one while splitting the multiwordexpr rows. No idea if that will perform better though.

import pandas as pd
columns=    ['text','lemma','pos','markintext','doublelemme','multiwordexpr','nodetail']

rows = [['Per','per','epsf',0,0,0,0],
    ['dietro a','dietro a','eilksl',0,0,1,0],
    ['Per','per','epsf',0,0,0,0]]

orig_f = pd.DataFrame(rows,columns=columns)
df = pd.DataFrame(columns=columns)


for index, row in orig_f.iterrows():
    # check for multiwordexpr
    if row[5] == 1:
        s = row.copy()
        s[0]   = row[0].split(' ')[0]     
        row[0] = row[0].split(' ')[1]        
        df = df.append(s)
        df = df.append(row)

    else:
        df = df.append(row)

df = df.reset_index(drop=True)
#there are no more multi words
df.ix[df['multiwordexpr']==1, 'multiwordexpr'] = 0
pho
  • 317
  • 1
  • 10
  • It's faster than my solution but still takes quite a while. I am starting to think thay maybe it's the whole way I framed the solution (i.e. modify a big dataframe) that may be wrong. On the other hand, I don't want to make life hard while parsing the xml file, since i'ts not clean at all. – Angelo Aug 28 '15 at 14:07
  • I am still experimenting, but seems that the best way is to append to a vector and the create a new df from a list of lists (built by appending df.col.values). – Angelo Aug 28 '15 at 15:48
  • Why not preallocate your df (as you can calculate the needed size) and write to it directly that should give you O(1) on the insert? – pho Aug 29 '15 at 15:21
  • But anyway I would need to shift the index, right? So, I have a df of len x and I know that I need, let's say, x+10. If I need to insert a line at position 2, I need to shift the index from 2 on. Am I right? – Angelo Aug 31 '15 at 06:52
  • Yes, you just have to increment your index for each insert. – pho Sep 04 '15 at 09:37