0

I want to extract the contents from an xml file based on the value of a field type. Basically, it was a json file, which I converted to xml. The file has fields, 'body','id', 'type' and snippets. I want to extract the contents of all these fields if the 'type='summary'. The code I have done is given:

def load_extract(data):
    path=""
    soup = BeautifulSoup(open(path),"html.parser")
    q1=[]
    qtype=[]
    snippets=[]
    for q in soup.findAll('body'):
            q=q.text
            q1.append(q)
    for types in soup.findAll('type'):

            type1=types.text
            qtype.append(type1)
    snippets=soup.findAll('snippets')
    summary_ids=[]
    summary_dict=[]
    for i in range (0, len(qtype)):
            print "extracting the summary type question"
            if qtype[i]=='summary':
               summary_ids.append(i)
    for j in summary_ids:
            summary_dict.append({q1[j]:snippets[j]})
    return summary_dict

The code works fine on a small set that I run, but with large set, the len(q1) is not equal to len(snippets). This creates a problem. I don't know whether the training data actually does not have snippets for some of the body. But this creates problem in mapping and hence extractions. I was thinking whether I could just extract the the body, id and snippets of type='summary'. Request your kind help!

Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
user3568044
  • 163
  • 1
  • 2
  • 13
  • 1
    Why are you using `html.parser` to parse XML? – abarnert May 01 '18 at 04:55
  • I used beautiful soup and I got a warning. I found in the [link] https://stackoverflow.com/questions/33511544/how-to-get-rid-of-beautifulsoup-user-warning. – user3568044 May 01 '18 at 04:59
  • The answer there says to specify the parser you'd like to use. `html.parser` is just one example. And it's obviously not the best example if you want to parse XML rather than HTML. Plenty of XML is not valid as HTML, because they're different languages. See [the documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) for the options. – abarnert May 01 '18 at 05:02
  • Thanks. i will check it. – user3568044 May 01 '18 at 05:11
  • whether this can be a reason of not extracting all snippets or in other case can we extract the information based on type='summary' as I pointed out in the question. – user3568044 May 01 '18 at 05:19

0 Answers0