0

So here is my RegEx:

re.findall(r'(?<=data-author=")(.*)(?=" data-author-fullname)'

I am trying to extract the username, in this case: "zCourge_idx", but for some reason my regex picks up everything until the next instance of "data-author-fullname" I can include more info if necessary

"zCourge_iDX" data-author-fullname="t2_6ups9" ><p class="parent"><a name="d4gqsup"></a></p><div class="midcol unvoted" ><div class="arrow up login-required access-required" data-event-action="upvote" role="button" aria-label="upvote" tabindex="0" ></div><div class="arrow down login-required access-required" data-event-action="downvote" role="button" aria-label="downvote" tabindex="0" ></div></div><div class="entry unvoted"><p class="tagline"><a href="javascript:void(0)" class="expand" onclick="return togglecomment(this)">[–]</a><a href="https://www.reddit.com/user/zCourge_iDX" class="author may-blank id-t2_6ups9" >zCourge_iDX</a><span class="userattrs"></span>&#32;<span class="score dislikes">0 points</span><span class="score unvoted">1 point</span><span class="score likes">2 points</span>&#32;<time title="Mon Jun 20 15:50:56 2016 UTC" datetime="2016-06-20T15:50:56+00:00" class="live-timestamp">12 minutes ago</time>&nbsp;<a href="javascript:void(0)" class="numchildren" onclick="return togglecomment(this)">(2 children)</a></p><form action="#" class="usertext warn-on-unload" onsubmit="return post_form(this, 'editusertext')" id="form-t1_d4gqsupk6x"><input type="hidden" name="thing_id" value="t1_d4gqsup"/><div class="usertext-body may-blank-within md-container "><div class="md"><p>Have you seen the box office reports?</p>
</div>
</div></form><ul class="flat-list buttons"><li class="first"><a href="https://www.reddit.com/r/movies/comments/4oygzu/warcraft_is_now_the_biggest_video_game_movie_of/d4gqsup" data-event-action="permalink" class="bylink" rel="nofollow" >permalink</a></li><li><a href="javascript:void(0)" data-comment="/r/movies/comments/4oygzu/warcraft_is_now_the_biggest_video_game_movie_of/d4gqsup" data-media="www.redditmedia.com" data-link="/r/movies/comments/4oygzu/warcraft_is_now_the_biggest_video_game_movie_of/" data-root="false" data-title="Warcraft is now the biggest video game movie of all-time" class="embed-comment" >embed</a></li><li class="comment-save-button save-button"><a href="javascript:void(0)">save</a></li><li><a href="#d4gqo0n" data-event-action="parent" class="bylink" rel="nofollow" >parent</a></li><li class="report-button"><a href="javascript:void(0)" class="reportbtn access-required" data-event-action="report">report</a></li><li class="give-gold-button"><a href="/gold?goldtype=gift&months=1&thing=t1_d4gqsup" title="give reddit gold in appreciation of this post." class="give-gold login-required access-required" data-event-action="gild" >give gold</a></li><li class="reply-button"><a class="access-required" href="javascript:void(0)" data-event-action="comment" onclick="return reply(this)">reply</a></li></ul><div class="reportform report-t1_d4gqsup"></div></div><div class="child"><div id="siteTable_t1_d4gqsup" class="sitetable listing"><div class=" thing id-t1_d4gqxdy noncollapsed &#32; comment " id="thing_t1_d4gqxdy" onclick="click_thing(this)" data-fullname="t1_d4gqxdy" data-type="comment" data-subreddit="movies" data-subreddit-fullname="t5_2qh3s" data-author="Serialdan </b>
melpomene
  • 84,125
  • 8
  • 85
  • 148
gseelig
  • 125
  • 7
  • [This answer](http://stackoverflow.com/a/1732454/4179728) may be relevant. – puzzlepalace Jun 20 '16 at 19:03
  • Go to [regex101.com](https://regex101.com/) and try out some regexes. Perhaps simplify your regex above first, get that simple version working, and then add to it. – Bulrush Jun 20 '16 at 19:15
  • I do not understand why you used the lookbehind. Try this `(".*?") data-author-fullname` – Oren Jun 20 '16 at 19:16
  • 1
    Why are you trying to parse HTML with a regex? – melpomene Jun 20 '16 at 19:24
  • It looks like the author name comes from data-author-fullname="-->here"`, not the other way around. –  Jun 20 '16 at 19:27

1 Answers1

0

As mentioned by Oren in the comments it is unclear why you are using lookbehind in the regex.

Try

 re.findall('.*"(.+?)"\s+data-author-fullname', string)

The non-greedy match will pick up the usernames, but I would still recommend you to use something other than regex for parsing HTML, likes of mechanize, beautifulsoup etc.

minocha
  • 1,043
  • 1
  • 12
  • 26