0

So I have all these html documents that have strings of capital letter in various places in alt tags, title tage, link text...etc.

<li><a title='BUY FOOD' href="http://www.example.com/food.html'>BUY FOOD</a></li>

What I need to do is replace all letters except the first letter with lowercase letting. Like so:

<li><a title='Buy Food' href="http://www.example.com/food.html'>Buy Food</a></li>

Now how can I do this either in python or some form of regex. I was told that my editor Coda could do something like this. But I can't seem to find any documentation on how to do something like this.

hackthisjay
  • 177
  • 4
  • 22
  • After reading you citing HTML and regex in the same question, I have to link to this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – GaretJax Aug 01 '11 at 23:08
  • http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html – Paul Aug 01 '11 at 23:14

3 Answers3

0

I think you need a HTML parser like BeautifulSoup, the rest would be details.

BrainStorm
  • 2,036
  • 1
  • 16
  • 23
0

There may be noteworthy exceptions for which fully automatic editing is not a good idea, but if you have a regex capable editor you might search for /[A-Z][A-Z]+/ and replace by hand.

Paul
  • 26,170
  • 12
  • 85
  • 119
0

I suggest you use Beautiful Soup to parse your HTML into a tree of tags, then write Python code to walk the tree of tags and body text and change to title case. You could use a regexp to do that, but Python has a built-in string method that will do it:

"BUY FOOD".title()  # returns "Buy Food"

If you need a pattern to match strings that are all caps, I suggest you use: "[^a-z]*[A-Z][^a-z]*"

This means "match zero or more of anything except a lower-case character, then a single upper-case character, then zero or more of anything except a lower-case character".

This pattern will correctly match "BUY 99 BEERS", for example. It would not match "so very quiet" because that does not have even a single upper-case letter.

P.S. You can actually pass a function to re.sub() so you could potentially do crazy powerful processing if you needed it. In your case I think Python's .title() method will do it for you, but here is another answer I posted with information about passing in a function.

How to capitalize the first letter of each word in a string (Python)?

Community
  • 1
  • 1
steveha
  • 74,789
  • 21
  • 92
  • 117