I want to extract the contents from an xml file based on the value of a field type. Basically, it was a json file, which I converted to xml. The file has fields, 'body','id', 'type' and snippets. I want to extract the contents of all these fields if the 'type='summary'. The code I have done is given:
def load_extract(data):
path=""
soup = BeautifulSoup(open(path),"html.parser")
q1=[]
qtype=[]
snippets=[]
for q in soup.findAll('body'):
q=q.text
q1.append(q)
for types in soup.findAll('type'):
type1=types.text
qtype.append(type1)
snippets=soup.findAll('snippets')
summary_ids=[]
summary_dict=[]
for i in range (0, len(qtype)):
print "extracting the summary type question"
if qtype[i]=='summary':
summary_ids.append(i)
for j in summary_ids:
summary_dict.append({q1[j]:snippets[j]})
return summary_dict
The code works fine on a small set that I run, but with large set, the len(q1) is not equal to len(snippets). This creates a problem. I don't know whether the training data actually does not have snippets for some of the body. But this creates problem in mapping and hence extractions. I was thinking whether I could just extract the the body, id and snippets of type='summary'. Request your kind help!