0

I want to extract with python the text from a XML file which contains tags, and also tags within tags

this is how my file looks like:

<p>blablabla</p>
<p>blablabla / blablabla,</p>
<p>blablabla</p>
<p>blablabla / blablabla / blablabla</p>
<p>blablabla.</p>

First I want to find whole entries (one whole entry in the file looks like the one above), then I want to split the entry in many parts after each "/", and finally remove all remaning tags "<p>" and "</p>"

Here is how I think this could be done (python2.7):

first_results = []

lines = open(sys.argv[1])

for l in lines:
    re.match(r'<p>[\s\S]*?\.<\/p>', l)
    l = l.split("/")
    first_results.append(l)

for b in first_results:
    b = re.sub(r'(<p>)|(</p>)', r'', b)

My question is: This is somewhow not working properly. I can get my entries right with regex, but I am not sure how to do the rest. Is there a better way to do this? At the end I want to get the text splitted by "/" and separated by tabs, something similar to this:

blablabla   blablabla   lablabla   blablabla   blablabla ect...

What would be the best method to to this. At this point I want to say that I am new with python, but already a big fan:)

El_Patrón
  • 533
  • 1
  • 10
  • 24

1 Answers1

0

First off, see this post: RegEx match open tags except XHTML self-contained tags. It is highly relevant to your situation.

Secondly, Python has a very nice XML parser in the xml package that ships with the language.

Community
  • 1
  • 1
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264