How to use the xpath to parse the director part from the html with python 3

Question

I intend to extract the the director's name(such as tom) from the following html (this just a part example of my html, the whole html, please access http://movie.walkerplus.com/list/2015/12/) with python 3 xpath. please give your hand to help me solve this issue. Thanks in advance!

  <title> ufffff</title>
  <div class="hiragana">2015<br>Dec 1st</br></div>
  <div class="movies">
  <div class="movie">
  <h3><a href="/mv57512/">007</a></h3>
  <dl class="directorList">
  <dt>director</dt>
  <dd>
  <a href="/person/152394/" title="">bruce</a>
  </dd>
  </dl>
  </div>
  </div>
  <div class="movies">
  <div class="movie">
  <h3><a href="/mv57512/">wind love</a></h3>
  <dl class="directorList">
  <dt>director</dt>
   <dd>
   <a href="/person/152394/" title="">tom</a>
   </dd>
   </dl>
   <div class="movies">
   <div class="movie">
   <h3><a href="/mv57512/">river war</a></h3>
   <dl class="directorList">
   <dt>director</dt>
   <dd>
   <a href="/person/152394/" title="">July</a>
   </dd>
   </dl>
   </div>
   </div>
   <div class="mwb">
   <div class="hiraganaLocalNavi">
   <ul class="page_12">
   <li class="text">o</li>
   <li><a class="m01" href="/list/2015/01/">1月</a></li>
   <li><a class="m02" href="/list/2015/02/">2月</a></li>
   <li><a class="m03" href="/list/2015/03/">3月</a></li>
   <li><a class="m04" href="/list/2015/04/">4月</a></li>
   <li><a class="m05" href="/list/2015/05/">5月</a></li>
   <li><a class="m06" href="/list/2015/06/">6月</a></li>
   <li><a class="m07" href="/list/2015/07/">7月</a></li>
   <li><a class="m08" href="/list/2015/08/">8月</a></li>
   <li><a class="m09" href="/list/2015/09/">9月</a></li>
   <li><a class="m10" href="/list/2015/10/">10月</a></li>
   <li><a class="m11" href="/list/2015/11/">11月</a></li>
   <li><a class="m12" href="/list/2015/12/">12月</a></li>
   </ul>
    </div>
    </div>
..................

Why do you want to approach the problem with regexes and not `BeautifulSoup`? Thanks. — alecxe, Jun 03 '16 at 04:18
Thanks for your comments! I want to learn parse html with python 3 rex, since rex skills will be used in my following work. — Ke Tian, Jun 03 '16 at 04:22
Okay, sure, just keep in mind that using regular expressions to parse HTML is generally a very bad idea. Please see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. — alecxe, Jun 03 '16 at 04:23
Now, I just want to use the rex to parse the html. If you can do it, please give me advices about my code. — Ke Tian, Jun 03 '16 at 05:42

score 2 · Accepted Answer · answered Jun 03 '16 at 09:25

Definitively use lxml for this instead. Like this:

from lxml import etree
f = StringIO(your_html_text)
tree = etree.parse(f)
what_you_are_looking_for = tree.xpath('//*[contains(concat(' ', @class, ' '), ' movies')]')

This is a very robust way of getting the data you want and will tolerate messy life (missing tags in the html, data moving around, etc.) much better than a regular expression.

You can read more about it here. Cheers!

score 1 · Answer 2 · answered Jun 03 '16 at 05:16

1

Read the link provided by alecxe. You are having that issue.

You have spaces in your raw string that do not occur in the sample html
Quotes are special characters and need to be escaped or replaced by '.'
You need to set the re.M flag for multiline strings '.' by default does not match newlines

Regex and HTML are a match destined for madness.

answered Jun 03 '16 at 05:16

WombatPM

2,561
2
22
22

Thanks, I just want to use the rex to extract the div part which mentioned in the above question. Could you give you rex code to solve this issue, please? – Ke Tian Jun 03 '16 at 06:07

How to use the xpath to parse the director part from the html with python 3

2 Answers2