0

I'm trying to extract some urls in an html file using python. Here is what the text look like:

preabc!precde<preefg<

I want to extract "cde" and "efg". The pattern I've used:

pre(.*?)<
pre(.(?!^pre)).*?<

However, none of them works:(. Note that real lengths of "cde" and "efg" are unknow. I'm not familier with regular expression so please explan your answers. Many thanks.

EDIT:

Sorry for my bad explanation and ambiguous example. I want to extract titles like "GIRL FRIENDS" with certain date (2014-7-31 in this case):

<a href="http://rs.xidian.edu.cn/forum.php?mod=viewthread&amp;tid=662128&amp;extra=page%3D1" onclick="atarget(this)" class="s xst">GIRL FRIENDS</a> <span class="tps">&nbsp;...<a href="http://rs.xidian.edu.cn/forum.php?mod=viewthread&tid=662128&amp;extra=page%3D1&amp;page=2">2</a></span> <a href="http://rs.xidian.edu.cn/forum.php?mod=redirect&amp;tid=662128&amp;goto=lastpost#lastpost" class="xi1">New</a> </th> <td class="by"> <cite> <a href="http://rs.xidian.edu.cn/home.php?mod=space&amp;uid=265770" c="1">机器人</a></cite> <em><span><span title="2014-7-31">昨天&nbsp;23:55</span></span></em> </td>

Adam Smith
  • 52,157
  • 12
  • 73
  • 112
  • Why the downvote? Could you explain it rather than just downvote? –  Jul 31 '14 at 18:34
  • Is 'abc' length know? is the '<' and '!' present? – f.rodrigues Jul 31 '14 at 18:36
  • Yea, we're going to need to work off of – skamazin Jul 31 '14 at 18:38
  • @hjpotter92 But it outputs "abc!precde" instead of "cde". –  Jul 31 '14 at 18:40
  • 1
    what makes "cde" and "efg" different from "pre" and "abc"? can you provide more examples of input + desired output? – redShadow Jul 31 '14 at 18:45
  • btw, I hope you're not trying to parse HTML using regular expressions.. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – redShadow Jul 31 '14 at 18:47
  • Do you want to extract "New" also or just "GIRL FRIENDS" – skamazin Jul 31 '14 at 18:56
  • @skamazin Just "GIRL FRIENDS". There're many titles in the same "pattern" in the html file. –  Jul 31 '14 at 18:58
  • 2
    @Wisatbff why don't you use an HTML parser, such as ``lxml.html``? You'll have a much more robust solution without having to get crazy with hyper-complex regexes.. – redShadow Jul 31 '14 at 19:00
  • 1
    What makes "GIRL FRIENDS" from all the other titles? Telling us "There're many titles in the same "pattern"" is basically saying regex won't work for you at all – skamazin Jul 31 '14 at 19:00
  • @skamazin The date below is supposed to be matched. –  Jul 31 '14 at 19:01
  • You need to use a parser for this. Look at BeautifulSoup or lxml – Adam Smith Jul 31 '14 at 19:01
  • Ok. I'll try those tools later. Anyway, thanks again. –  Jul 31 '14 at 19:02
  • @Wisatbff Yea if you looking to do this for a very long file or for many files, I would go with a parser and not a regex. But if it's only this one instance, I can find a regex that'll work for you. – skamazin Jul 31 '14 at 19:04
  • Alrighty, try my regex in my answer. Tell me if something goes wrong – skamazin Jul 31 '14 at 19:08

3 Answers3

2

You can try:

 (>([A-Z ]+?)<|title="([\d-]+))

Test it here

The more specific and less predictable you get, the more complicated and unreadable the regex is going to be. I don't suggest using regex for this, instead try an HTML parser.

skamazin
  • 757
  • 5
  • 12
  • 1
    +1 the most frequently asked question gets the most frequent answer. http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – msw Jul 31 '14 at 19:12
  • Ok, forget it. You are right. I'll use a HTML parser instead. Thank you all:) –  Jul 31 '14 at 19:12
1

I think the best answer is to not try and parse HTML with a regex. There are lots of html parsing libraries available. Using a regex is only going to cause headaches.

Bradley Kaiser
  • 776
  • 4
  • 16
0

This should do the trick:

pre.*!pre(.*)<pre(.*)<

Explanation:

pre.*! ignore the first part the 'abc' since it starts: start with pre, has a body of anycharacter in anylength(the .* part meets anything) ends with a !

pre(.*)< take the cde. Does the same as the above, but instead it stores whatever is in the body in the matching group 1, the () are matching groups.

pre(.*)< takes the efg. Same as above but stores in the matching group 2

Note that the ! and both < are the ones responsible for dividing the string.

f.rodrigues
  • 3,499
  • 6
  • 26
  • 62
  • nope. [Beware ZA̡͊͠͝LGΌ!](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – msw Jul 31 '14 at 19:14