Remove whitespace from an array of HTML strings

Question

Given array

["<p>&gt;&gt;&gt;Lorem ipsum dolor</p>",
"<p>Lorem ipsum dolor <strong>sit amet, consectetur adipisicing</strong> elit, sed do eiusmod</p>",
"<p>.....</p>",
"<p> ...</p>",
"<p>tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,</p>",
"<p>quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo</p>",
"<p>… </p>",
"<p>consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse</p>",
"<p>…</p>",
"<p>. . . </p>",
"<p> …</p>",
"<p>cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non</p>",
"<p>…</p>",
"<p>…</p>",
"<p>proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>",
"<p></p>",
"<p></p>",
"<p>proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"]

I want to receive array without paragraph tag which include … or ... with spaces at begining or at the end and replace tag which include … or ... with "…"

["<p>&gt;&gt;&gt;Lorem ipsum dolor</p>",
"<p>Lorem ipsum dolor <strong>sit amet, consectetur adipisicing</strong> elit, sed do eiusmod</p>",
"<p>…</p>",
"<p>tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,</p>",
"<p>quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo</p>",
"<p>…</p>",
"<p>consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse</p>",
"<p>…</p>",
"<p>cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non</p>",
"<p>…</p>",
"<p>proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>",
"<p></p>",
"<p></p>",
"<p>proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"]

Please read "[ask]" and the linked pages, and "[mcve]". We expect to see your attempt to solve the problem, either as the sites you searched to find solutions and an explanation why they didn't help, or some code attempt and why it fails to do what you want. Without that it looks like you want us to write code for you, which is off-topic. "[How much research effort is expected of Stack Overflow users?](http://meta.stackoverflow.com/a/261593/128421)" is also useful. — the Tin Man, Aug 22 '16 at 20:54
I guess the nominated exemplar does answer the OP's question, so I'll mark this as a duplicate. Note that all of the exemplar's answers use regexen to manipulate HTML, which is wicked (although sometimes you can get away with it). In the general case, an HTML parser such as Nokogiri is better. In specific cases, regexen can work. — Wayne Conrad, Aug 22 '16 at 21:02

davidhu · Accepted Answer · 2016-08-22T20:24:52.207

I would loop through each element of the array and modify each of the . . . into the desired format.

array.map! do |el|
    if el =~ /<p>(((\s?\.)+(\s+)?)|(\s+)?…(\s+)?)<\/p>/
        el = '<p>…</p>'
    end
    el
end

This code will replace every p tag with the . . . format with …, resulting in

["<p>&gt;&gt;&gt;Lorem ipsum dolor</p>",
"<p>Lorem ipsum dolor <strong>sit amet, consectetur adipisicing</strong> elit, sed do eiusmod</p>",
"<p>…</p>",
"<p>…</p>",
"<p>tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,</p>",
"<p>quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo</p>",
"<p>…</p>",
"<p>consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse</p>",
"<p>…</p>",
"<p>…</p>",
"<p>…</p>",
"<p>cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non</p>",
"<p>…</p>",
"<p>…</p>",
"<p>proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>",
"<p></p>",
"<p></p>",
"<p>proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"]

Then I will check each element against the previous element, and delete it if it matches the previous element and the current element is equal to …

idx = array.length - 1
while idx > 0 
    if array[idx] == array[idx - 1] && array[idx] == '<p>…</p>'
        array.delete_at(idx)
    end
    idx -= 1
end

While waiting answer I found this decision `array.join('').gsub(/(
[… .]+<\/p>)+/, '
…
').gsub(/<\/p>
/, "<\/p>\n
").split("\n")`. Maybe regex is not right but I will correct it to my needs. — Victor Borshchov, Aug 22 '16 at 20:26

Remove whitespace from an array of HTML strings

1 Answers1