I have parsed an xml file containing some part-of-speech tagged text and I since the file is not perfect I am adding the data to a pandas dataframe in order to later clean it.
At this point I will need to duplicate some rows based on certain values and modify only one or two values in the duplicated row and in the original one.
This is what the actual dataframe looks like:
In [8]: df.head()
Out[8]:
text lemma pos markintext doublemma multiwordexpr nodetail
0 Per per epsf 0 0 0 0
1 correr correre vta2fp 0 0 0 0
2 miglior migliore a2fp 0 0 0 0
3 acque acqua sf1fp 0 0 0 0
4 alza alzare vta1ips3 0 0 0 0
Now, if, for example, multiwordexpr is equal to 1, I want to duplicate the row and insert it in the database. So, I would like to go from this:
In [10]: df[df['multiwordexpr'] == 1]
Out[10]:
text lemma pos markintext doublemma multiwordexpr
16 dietro a dietro a eilksl 0 0 1
to this:
text lemma pos markintext doublemma multiwordexpr
16 dietro dietro a eilksl 0 0 1
17 a dietro a eilksl 0 0 1
This is my code
#!/usr/bin/python
# -*- coding: latin-1 -*-
from lxml import etree
import locale
import sys
import os
import glob
import pandas as pd
import numpy as np
import re
from string import punctuation
import random
import unicodedata
def manage_tail(taillist):
z = []
for line in taillist:
y = list(line.strip())
for punkt in y:
z.append(punkt)
return z if len(z) > 0 else 0
def checkmark(text):
pattern = re.compile("\w|'",re.UNICODE)
if re.match(pattern,text[-1]):
return 0
else:
return text[-1]
path = "~/working_corpus/"
output_path = "~/devel_output/"
f = "*.xml"
docs = [f for f in glob.glob(os.path.join(path,f))]
parser = etree.XMLParser(load_dtd= True,resolve_entities=True)
x = []
for d in docs:
tree = etree.parse(d,parser)
for node in [z for z in tree.iterfind(".//LM")]:
text = node.text.strip()
multiwordexpr = 1 if (' ' in text.replace(' ', ' ')) else 0
lemma = node.get('lemma')
markintext = checkmark(text)
pos = node.get('catg')
doublemma = 1 if (node.getparent() is not None and node.getparent().tag == 'LM1') else 0
nodetail = manage_tail(node.tail.splitlines()) if node.tail else None
row = [text,lemma,pos,markintext,doublemma,multiwordexpr,nodetail]
x.append(row)
df = pd.DataFrame(x,columns=('text','lemma','pos','markintext','doublemma','multiwordexpr','nodetail'))
I've thought about something like this for managing the case in which nodetail is true(so not the multiwordexpr problem exactly, but the point is the same: how to efficiently add a row in an arbitrary position, so not at the end), but I don't know how to really do it efficiently. I am looking for a function that given one or more condition, inserts a certain number of duplicated rows under the selected row and modifyes one or two values in the other columns (in this case, it splits the text and duplicates the row).
l = []
i = 0
while i < len(df):
if (df.iloc[i,6] != 0):
ntail = df.iloc[i,6]
df.iloc[i,6] = 0
i += 1
for w in range(len(ntail)):
line = pd.DataFrame({'text': ntail[w],
'lemma': ntail[w],
'pos':'NaN',
'markintext':0,
'doublemma':0,
'multiwordexpr':0,
'nodetail':0},index=[i+w], columns=('text','lemma','pos','markintext','doublemma','multiwordexpr','nodetail'))
l.append(line)
else:
pass
i += 1
sys.stdout.write("\r%d/%d" % (i,len(df)))
sys.stdout.flush()
print "...done extracting."
for i in range(len(l)):
start = int((l[i].index[0])-1)
end = int(l[i].index[0])
df = pd.concat([df.ix[:start], l[i], df.ix[end:]]).reset_index(drop=True)
sys.stdout.write("\r%d/%d" % (i,len(l)))
sys.stdout.flush()