1

I have a html file with different team names written throughout the file. I just want to grab the team names. The team names always occur after certain text and end before certain text, so I've split function to find the team name. I'm a beginner, and I'm sure I'm making this harder than it is. Data is the file

teams = data.split('team-away">')[1].split("</sp")[0]
    for team in teams:
        print team

This returns each individual character for the first team that it finds (so for example, if teams = San Francisco 49ers, it prints "S", then "A", etc. instead of what I need it to do: Print "San Francisco 49ers" then on the next line the next team "Carolina Panthers", etc.

Thank you!

sdeep27
  • 11
  • 2
  • 3
    Parse the HTML. Don't work with it as a string. – Blender Nov 18 '13 at 05:52
  • [Déjà vu](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – shx2 Nov 18 '13 at 06:02
  • Yeah, I wondered how long before someone links to "Tony the Pony". The OP does state "I have a html file...The team names always occur after certain text and end before certain text" Assuming @sdeep27 is correctly describing the problem (and who would know but himself), then plain text search works 100% (but of course, not best practice). – Paul Draper Nov 18 '13 at 07:15

2 Answers2

2

"I'm a beginner, and I'm sure I'm making this harder than it is."

Well, kind of.

import re
teams = re.findall('team-away">(.*)</sp', data)

(with credit to Kurtis, for a simpler regular expression than I originally had)

Though an actual HTML parser would be best practice.

Paul Draper
  • 78,542
  • 46
  • 206
  • 285
  • Out of curiosity, why is this regex superior to re.findall('team-away">(.*?) – Curt Nov 18 '13 at 07:05
  • 1
    @Kurtis, you are right. I had assumed that `findall` would match the *entire* regex, so I used lookbehind and lookahead. But if exactly one capture group exists, the returned match is only that group. Improving answer now. – Paul Draper Nov 18 '13 at 07:10
  • so I went and looked over some re to understand your answer, and this does tackle it, thank you. Quick question though, as I only need the letters returned and not numbers, I tried (\w*) instead of (.*) and it returned an empty list. Do you know why this could be, or is \w the wrong expression to use? – sdeep27 Nov 18 '13 at 08:07
  • I'm not sure exactly what you mean. Perhaps `([A-Za-z]*).*`? – Paul Draper Nov 18 '13 at 16:29
  • isn't \w the regex identifier for all letters? How come we use the format [A-Za-z] instead? – sdeep27 Nov 20 '13 at 18:59
  • @sdeep27, `\w` is all "word" characters, which consists of all ASCII letters and underscore. – Paul Draper Nov 21 '13 at 08:17
0

Don't re-invent the wheel! Look into BeautifulSoup, it'll to the job for you.

Steinar Lima
  • 7,644
  • 2
  • 39
  • 40