Regular expression to strip everything but words

Question

I'm helpless on regular expressions so please help me on this problem.

Basically I am downloading web pages and rss feeds and want to strip everything except plain words. No periods, commas, if, ands, and buts. Literally I have a list of the most common words used in English and I also want to strip those too but I think I know how to do that and don't need a regular expression because it would be really way to long.

How do I strip everything from a chunk of text except words that are delimited by spaces? Everything else goes in the trash.

This works quite well thanks to Pavel .split(/[^[:alpha:]]/).uniq!

nokogriri is probably the best solution here because it is an HTML parser and i guess one shouldn't be using regex to do this. — thenengah, Aug 23 '10 at 16:32

score 3 · Accepted Answer · edited May 23 '17 at 11:48

3

I think that what fits you best would be splitting of the string into words. In this case, String::split function would be the better option. It accepts a regexp that matches substrings, which should split the source string into array elements.

In your case, it should be "some non-alphabetic characters". Alphabetic character class is denoted by [:alpha:]. So, here's the example of what you need:

irb(main):001:0> "asd, < er >w , we., wZr,fq.".split(/[^[:alpha:]]+/)
=> ["asd", "er", "w", "we", "wZr", "fq"]

You may further filter the result by intersecting the resultant array with array that contains only English words:

irb(main):001:0> ["asd", "er", "w", "we", "wZr", "fq"] & ["we","you","me"]
=> ["we"]

edited May 23 '17 at 11:48

Community

1
1

answered Aug 21 '10 at 19:53

P Shved

96,026
17
121
165

ok that worked quite will but I'm getting a done of empty strings in the array – thenengah Aug 21 '10 at 19:56
@Sam, perhaps, you could find helpful information in `split` documentation? It should contain tips about situations, in which empty strings appear. – P Shved Aug 21 '10 at 19:59
1

if you want to retrieve words of other languages (with characters like 'ä', 'ß', 'ǹ' etc), use this regex in your split call: /[^\p{Alpha}]+/ – Ragmaanir Aug 21 '10 at 22:13

score 0 · Answer 2 · answered Aug 21 '10 at 19:48

0

try \b\w*\b to match whole words

answered Aug 21 '10 at 19:48

ennuikiller

46,381
14
112
137

Regular expression to strip everything but words

2 Answers2

Linked