Ruby remove all substrings that begin with specific character

Question

I would like to remove all substrings from a string that begin with a pound sign and end in a space or are at the end of the string. I have a working solution, but I'm wondering if there's a more efficient (or equally efficient but less wordy) approach.

For example, I want to take "leo is #confused about #ruby #gsub" and turn it into "#confused #ruby #gsub".

Here is my solution for now, which involves arrays and subtraction.

strip_spaces = str.gsub(/\s+/, ' ').strip()
  => "leo is #confused about #ruby #gsub"
all_strings = strip_spaces.split(" ").to_a
  => ["leo", "is", "#confused", "about", "#ruby", "#gsub"]
non_hashtag_strings = strip_spaces.gsub(/(?:#(\w+))/) {""}.split(" ").to_a
  => ["leo", "is", "about"]
hashtag_strings = (all_strings - non_hashtag_strings).join(" ")
  => "#confused #ruby #gsub"

To be honest, now that I'm done writing this question, I've learned a few things through research/experimentation and become more comfortable with this array approach. But I still wonder if anyone could recommend an improvement.

there is no need to create non_hashtag_strings array. Just use `map` on all_strings: `all_strings.map { |s| s[0] == "#" }.join(" ")` — Slava.K, Jan 16 '17 at 16:26
You say "I would like to remove all substrings from a string that begin with a pound sign and end in a space." and "I want to take `"leo is #confused about #ruby #gsub"` and turn it into `#confused #ruby #gsub`". Those statements are contradictory unless you are assuming that the former is the first step in answering the latter. If so, that makes it a so-called [XY](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) question. You will learn more by asking how a desired result can be achieved without adding a constraint that a particular approach be taken. — Cary Swoveland, Jan 16 '17 at 19:33

score 3 · Accepted Answer · answered Jan 16 '17 at 16:26

3

I would do something like this:

string = "leo is #confused about #ruby #gsub"
#=> "leo is #confused about #ruby #gsub"
string.split.select { |word| word.start_with?('#') }.join(' ')
#=> "#confused #ruby #gsub"

answered Jan 16 '17 at 16:26

spickermann

100,941
9
101
131

Thank you @spickermann, I gather `.split` splits on whitespace by default ( http://apidock.com/ruby/String/split ), do you think there's any reason to use `.split(' ')`, or is that redundant? – Leo Folsom Jan 16 '17 at 16:40
1

Writing `split(' ')` instead of `split` is just redundant IMO. – spickermann Jan 16 '17 at 16:47

score 3 · Answer 2 · answered Jan 16 '17 at 16:31

3

Regexp only solution

string = "leo is #confused about #ruby #gsub"
string.scan(/#\w+/)
#  => ["#confused", "#ruby", "#gsub"]

If you expect # sign inside the word, the regexp is slightly complex:

string = "leo is #confused ab#out #ruby #gsub"
string.scan(/(?<=\s)#\w+/)
#  => ["#confused", "#ruby", "#gsub"]

answered Jan 16 '17 at 16:31

mikdiet

9,859
8
59
68

Is there an advantage to Regexp-only over the `starts_with?` solutions offered by @spickermann and @Richard? With your solution, I would of course add the final `.join(" ")`. – Leo Folsom Jan 16 '17 at 16:34
Do you have a preference for Regexp only? If so why? If not ... what would you do? I realize we are splitting hairs but I'd like to understand the nuances. – Leo Folsom Jan 16 '17 at 16:40
1

@MikDiet: IMHO that is not correct. Regexps are usually slower than pure string operations. See: http://stackoverflow.com/a/14275592/2483313 Regexp have other advantages... – spickermann Jan 16 '17 at 16:41
When I told about performance I mean O-complexities. They are the same for all solutions. – mikdiet Jan 16 '17 at 17:00
@spickermann not to be argumentative but is this still the case when your version requires an Array iteration? The example is a clean string to string comparison but you are creating an intermediary `Array` by using `String#split` first and then using `Array#join` which I am assuming internally iterates as well to accumulate. – engineersmnky Jan 16 '17 at 18:17
1

@spickermann according to basic benchmarking MikDiet's response appears to be correct in stating they are equal in performance. (I have added the benchmarks to my answer) – engineersmnky Jan 16 '17 at 18:48
@engineersmnky The benchmark proves that I was wrong. Sorry about that. You are right the creation of the array seems to cost the same than compiling the Regexp in this example. – spickermann Jan 16 '17 at 19:18
1

@spickermann the benchmarks also proves I provided examples that are far less performant than yours :). That being said I would still choose your implementation, over the regex, for human readability. – engineersmnky Jan 16 '17 at 19:22

engineersmnky · Answer 3 · 2017-01-23T01:02:24.813

Always more ways to skin a cat

s = "leo is #confused about #ruby #gsub"
#sub all the words that do not start with a #
s.gsub(/(?<=^|\s)#\w+\s?/,'')
#=> "#confused #ruby #gsub"
#split to Array and grab all the strings that start with #
s.split.grep(/\A#/).join(' ')
#=> "#confused #ruby #gsub"
#split to Array and separate them into 2 groups
starts_with_hash,others = s.split.partition {|e| e.start_with?('#') }
#=>[["#confused", "#ruby", "#gsub"], ["leo", "is", "about"]]
starts_with_hash.join(' ') 
#=> "#confused #ruby #gsub"

Benchmarking of these and other answers as provided by fruity

require 'fruity'

def split_start_with(s)
    s.split.select {|e| e.start_with?("#")}.join(' ')
end

def with_scan(s)
    s.scan(/#\w+/).join(' ')
end

def with_gsub(s)    
  s.gsub(/(?<=^|\s)#\w+\s?/,'')
end

def split_grep(s)
    s.split.grep(/\A#/).join(' ')
end

str = "This is a reasonable string #withhashtags where I want to #test multiple #stringparsing #methods for separating and joinging #hastagstrings together for #speed"

compare do 
  split_start_with_test {split_start_with(str)}
  with_scan_test {with_scan(str)}
  with_gsub_test {with_gsub(str)}
  split_grep_test {split_grep(str)}
end

Results:

Running each test 262144 times. Test will take about 5 minutes.
split_start_with_test is similar to with_scan_test
with_scan_test is faster than with_gsub_test by 60.00000000000001% ± 1.0%
with_gsub_test is faster than split_grep_test by 30.000000000000004% ± 1.0%

score 1 · Answer 4 · answered Jan 16 '17 at 16:25

1

You could try this

string.split(' ').select { |e| e.start_with?("#") }.join(' ')

Explanation

split - Breaks a string into an array of substrings based on a delimiter, in this case a space

select - Used to filter an array that matches the passed in expression

|e| e.start_with?("#") - Find only the substrings that start with a pound sign

join(' ') - Used to transform an array back to a string

answered Jan 16 '17 at 16:25

Richard Hamilton

25,478
10
60
87

Thank you @Richard. Would you be OK editing out the redundant `split(' ')` and just using default `split`? Unless you feel `(' ')` is needed, in which case please explain. – Leo Folsom Jan 16 '17 at 17:52

Ruby remove all substrings that begin with specific character

4 Answers4