Python extract text from xml

Question

I want to extract with python the text from a XML file which contains tags, and also tags within tags

this is how my file looks like:

<p>blablabla</p>
<p>blablabla / blablabla,</p>
<p>blablabla</p>
<p>blablabla / blablabla / blablabla</p>
<p>blablabla.</p>

First I want to find whole entries (one whole entry in the file looks like the one above), then I want to split the entry in many parts after each "/", and finally remove all remaning tags "<p>" and "</p>"

Here is how I think this could be done (python2.7):

first_results = []

lines = open(sys.argv[1])

for l in lines:
    re.match(r'<p>[\s\S]*?\.<\/p>', l)
    l = l.split("/")
    first_results.append(l)

for b in first_results:
    b = re.sub(r'(<p>)|(</p>)', r'', b)

My question is: This is somewhow not working properly. I can get my entries right with regex, but I am not sure how to do the rest. Is there a better way to do this? At the end I want to get the text splitted by "/" and separated by tabs, something similar to this:

blablabla   blablabla   lablabla   blablabla   blablabla ect...

What would be the best method to to this. At this point I want to say that I am new with python, but already a big fan:)

Please post a real example of HTML – Andrés Pérez-Albela H. Nov 30 '15 at 17:43 — Andrés Pérez-Albela H., Nov 30 '15 at 17:43

score 0 · Answer 1 · edited May 23 '17 at 12:15

0

First off, see this post: RegEx match open tags except XHTML self-contained tags. It is highly relevant to your situation.

Secondly, Python has a very nice XML parser in the xml package that ships with the language.

edited May 23 '17 at 12:15

Community

1
1

answered Nov 30 '15 at 17:47

Mad Physicist

107,652
25
181
264

Python extract text from xml

1 Answers1