I want to extract with python the text from a XML file which contains tags, and also tags within tags
this is how my file looks like:
<p>blablabla</p>
<p>blablabla / blablabla,</p>
<p>blablabla</p>
<p>blablabla / blablabla / blablabla</p>
<p>blablabla.</p>
First I want to find whole entries (one whole entry in the file looks like the one above), then I want to split the entry in many parts after each "/", and finally remove all remaning tags "<p>" and "</p>"
Here is how I think this could be done (python2.7):
first_results = []
lines = open(sys.argv[1])
for l in lines:
re.match(r'<p>[\s\S]*?\.<\/p>', l)
l = l.split("/")
first_results.append(l)
for b in first_results:
b = re.sub(r'(<p>)|(</p>)', r'', b)
My question is: This is somewhow not working properly. I can get my entries right with regex, but I am not sure how to do the rest. Is there a better way to do this? At the end I want to get the text splitted by "/" and separated by tabs, something similar to this:
blablabla blablabla lablabla blablabla blablabla ect...
What would be the best method to to this. At this point I want to say that I am new with python, but already a big fan:)