1

I am trying to parse the yahoo answers feed - http://answers.yahoo.com/rss/allq The issue is that the titles have

[ Category ] : Open Question :

in every title that I do not want... I want to write a regexp to remove this...

anything that we can make to remove all the letters in the starting [ and the first : should do it.

there is a space after the : also, we need to remove that too.

Thanks for this in advance, I will also try to find a solution myself.

Dylan Corriveau
  • 2,561
  • 4
  • 29
  • 36
foxybagga
  • 4,184
  • 2
  • 34
  • 31

2 Answers2

1

the following regex should do the job:

^\[.*?: 

Usage sample in c#:

string resultString = Regex.Replace(subjectString, @"^\[.*?: ", "");

What it does is start with an [ bracket and take any characters until it matches a : and take the follwing space.

Hope this helps, Tom.

Thanks @ cmptrgeekken for pointing the non greedy thing out!

Community
  • 1
  • 1
RoXX
  • 1,664
  • 1
  • 24
  • 28
  • 1
    Might want to make that `.*?` so it's a non-greedy match. Otherwise, if the title itself has a colon in it, this regex would remove everything up to the second colon – cmptrgeekken Sep 11 '10 at 15:35
1

Have you considered using Yahoo's YQL service to parse this feed (or other web pages)?

They already have sample queries for you to get at Yahoo Answers data:

(Just an FYI in case you weren't aware of this convenient service. I use it instead of screen scraping with RegEx's.)

Community
  • 1
  • 1
JohnB
  • 18,046
  • 16
  • 98
  • 110