0

I write a python for indexing data that they are in folders include XML files but in the running only one folder index and I get this Error: xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 5933, column 23 my python code:

import os
from xml.etree import ElementTree
from elasticsearch import Elasticsearch

es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
def start(path):
    tree = ElementTree.parse(path)
    root = tree.getroot()
    print(root)
    docs = tree.findall('.//DOC')
    for doc in docs:
        title = doc.find('TITLE').text
        text = doc.find('TEXT').text
        all_dates = doc.findall('DATE')
        date = ''
        for d in all_dates:
            if d.attrib["calender"] == "Western":
                date = d.text
                break

    doc = {
        'title': title,
        'text': text,
        'date': date
    }
    insert_to_elastic(doc)

def insert_to_elastic(doc):
    es.index(index='hamshahri', doc_type='document', body=doc)

def xml_files():
    folders = ['2002','2003','2004','2005']
    xml_list = []
    for item in folders:
       path = '/home/course/web/Hamshahri/'+ item
       for filename in os.listdir(path):
          fullname = os.path.join(path, filename)
          if fullname.endswith('.xml'):
              xml_list.append(fullname)
    return xml_list

xmls = xml_files()
for path in xmls:
   print(path)
   start(path)
  • 4
    What is on line 5933 in the XML file it is reading when you get the error? – Tyler V Jul 13 '18 at 16:43
  • 1
    The error is telling you that your XML is invalid. We can’t debug this without seeing the XML. Ideally, reduce the file to the smallest piece of it that reproduces that error, and reduce your code to just the code that tries to parse that XML rather than your whole program, and post that as a [mcve]. At an absolute minimum, you have to give us enough of the file to diagnose the problem. – abarnert Jul 13 '18 at 17:01
  • Similar https://stackoverflow.com/q/51049975/5320906 – snakecharmerb Jul 13 '18 at 17:09

0 Answers0