-3

I am trying to create a python regular expression that can match any name. I am scraping a web page and looking for the <h1> tag and grabbing the name in between it. The names can include James Dean, James-Dean, Brian O'Quin, Jame Joe-Harden, etc...

This was the first regular expression I have been working with but it is not catching all the names

<h1>[A-Z]{1}[a-z]+\s[A-Z]{1}[']?[A-Z]?[-]?[A-Z]?[a-z]+
Ethan Collins
  • 45
  • 1
  • 6
  • `{1}` is unneeded – depperm Dec 21 '18 at 21:05
  • 3
    Don’t use regexp for HTML or [He Comes](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – Eb946207 Dec 21 '18 at 21:05
  • 1
    Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Eb946207 Dec 21 '18 at 21:05
  • 2
    [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Eb946207 Dec 21 '18 at 21:13
  • why not [parse with an html library](https://stackoverflow.com/q/11709079/1358308) then use an [xpath selector](https://stackoverflow.com/a/11466033/1358308) to match the appropriate tags – Sam Mason Dec 21 '18 at 21:17
  • maybe you should explain better what you are looking for. You gave some examples, but didn't say *exactly* which characters you are trying to match – Leonardo Maffei Dec 21 '18 at 21:36
  • @LeonardoMaffei I am looking inside of html and looking for something like this

    name

    example [link](http://www.espn.com/college-football/player/_/id/4360076/dylan-oquinn). I am trying to grab the player's name at the top of the page
    – Ethan Collins Dec 21 '18 at 21:41

1 Answers1

-1

Maybe this:

<h1>(([-'\w]+\s?)+)<h1>

Explaining:

the - matches itself, \w matches letters and numbers, and the plus is to capture one or more of these occurrences. Also, is optional a space character after this, to support composed names.
Finally, the last + plus ensures that you can repeat the structure I've just described.
Hope this help.

Leonardo Maffei
  • 352
  • 2
  • 6
  • 16
  • This is kinda working and I have been doing some testing and this is where I'm at the regular expression I have is `

    ([-\'\w]+\s+\w+)` and when I try to extract the name **Dylan O'Quin** I return this **Dylan O** Any suggestions?

    – Ethan Collins Dec 22 '18 at 19:13
  • just add a *+* after the last parentheses. Compare the result through (regex101.com)[regex101] and you will understand your mistake. Basically, this plus will *keep* looking for the pattern ```([-\'\w]+\s+\w+)``` over and over again – Leonardo Maffei Dec 22 '18 at 23:15
  • @LeonardiMaffei Thank you for your help! I ended up having to many troubles with regex and found a different solution with a html parser and BeautifulSoup – Ethan Collins Dec 23 '18 at 17:29
  • The last

    in the answer needs to be

    though.
    – Mr Lister Dec 23 '18 at 18:17
  • 1
    @MrLister In python regex the (/) in the closing header tag is an escape character. This is caused by any unescaped delimiter must be escaped with a backslash (\) and will break the pattern matching. As before I found that using an html parser with Beautiful soup was much much much easier lol – Ethan Collins Dec 24 '18 at 07:57