3

I'm write a parser that gets data from hidden iframes.

In text i need to replace \n (↵) characters by (space). I use this for this task - text.replace(/\n/gi, " "). However, it is only works for visible elements (i.e. don't haven't display: none). If the element is not visible (display: none) new-lines just disappears and don't get any replacement.

HTML Example:

<div data-custom="languages">
    <div>
        <div>
            <h2>
                <span>Just a text that will be removed</span>
            </h2>
            <p>A - b</p>
            <p>c - d</p>
        </div>
    </div>
</div>

JS Example:

visibleIframe.style.display = "block";
invisibleIframe.style.display = "none";

const visibleDivWithNestedDivs = visibleIframe.querySelector(`[data-custom="languages"]`);
const invisibleDivWithNestedDivs = invisibleIframe.querySelector(`[data-custom="languages"]`);

const visibleText = visibleDivWithNestedDivs.innerText; // "A - b↵c - d"
const invisibleText = invisibleDivWithNestedDivs.innerText; // "A - b↵c - d"

console.log(visibleText.replace(/\n/gi, " ")); // "A - b c - d" (expected result)
console.log(invisibleText.replace(/\n/gi, " ")); // "A - bc - d" (unexpected result, no space between "b" and "c")

What I tried:

.replace(/\n/gi, " ")
.replace(/\r\n/gi, " ")
.replace(/↵/gi, " ")
.replace(/↵↵/gi, " ") // in some cases there was two of this.
.split("↵").join(" ") 
.split("\n").join(" ")
white-space: pre
white-space: pre-wrap

Did you testing?

I'm 99% sure it's because of display: none. I tested it and different display of iframes give me different result.

TextContent

I don't need textContent because this returns a text without \n characters. I use innerText.

Questions:

  1. Can unexpected result be not because of that display: none?
  2. How should i do to achieve the expected result?
Amaimersion
  • 787
  • 15
  • 28

2 Answers2

4

First, let's clear up a few misunderstandings you seem to have based on the examples you've provided.

is a unicode character described as DOWNWARDS ARROW WITH CORNER LEFTWARDS. Sure, it makes a nice visual representation of a line break or the Return/Enter key, but it has no meaning in code. If you use this symbol in a regular expression, the regular expression will try to match for text that includes the arrow symbol.

In most programming languages, \n in a string represents a line break, and you don't have to be bothered by how it is represented under the hood, be it with a CR, an LF, or both. So I wouldn't use \r in JavaScript.

.replace(/\n/gi, " ") is a perfectly valid option, depending on what you want to do. You might want to replace any sequence of whitespace that includes newlines, however. In that case, I would use this instead: .replace(/\s+/, " "). The \s special code in RegExp matches any kind of white space including line breaks. Adding a + makes it match any sequence of white space. Using this will ensure that a string like this one "a \n \n b" gets turned into "a b".

Now that the regular expression issues have been dealt with, let's look at innerText. According to the HTML Living Standard which I found by looking at the MDN article for innerText, the innerText property is an approximation of what the user will get when copy-pasting the text from that element. It is defined like this:

If this element is not being rendered, or if the user agent is a non-CSS user agent, then return the same value as the textContent IDL attribute on this element. Note: This step can produce surprising results, as when the innerText attribute is accessed on an element not being rendered, its text contents are returned, but when accessed on an element that is being rendered, all of its children that are not being rendered have their text contents ignored.

This answers why there might be a difference between visible and hidden elements. As for the number of line breaks, the algorithm that determines how many line breaks are in the string is defined recursively on the standard page and it is quite confusing, which is why I would advise not to base your logic on the behavior of this function. innerText is meant to be an approximation.

I suggest taking a look at textContent, which isn't affected by CSS.

So to wrap up this long explanation:

  1. Yes, display: none does influence innerText
  2. I might use foo.textContent.replace(/\s+/g, " ") depending on what your goal is.
Domino
  • 6,314
  • 1
  • 32
  • 58
  • Thank you! Now it is clear up a things for me. 1) `.replace(/\s+/, " ")` – awesome, i will use it! 2) `textContent` – again, according to [MDN](https://developer.mozilla.org/en-US/docs/Web/API/Node/textContent) "...textContent returns the concatenation of the textContent of every child node...". Unfortunately, it doesn't do a trick for me. It is just returns a concatenation without any separator, but i need a one space between. _If i only could call it as a function and set a separator._ Looks like i need to write my own implementation of `textContent`, but with ability use a separator. – Amaimersion Sep 24 '18 at 17:31
  • Ah, I see what you mean. I'm afraid that the only way you'll be able to extract the content you want with 100% confidence that it works everywhere is to iterate through the children manually, yes. – Domino Sep 24 '18 at 18:26
  • I just created [workaround](https://stackoverflow.com/questions/52480730/replace-n-in-non-render-non-display-element-text#52486712) that uses `innerHTML`. However, it is not appropriate for 100% confidence :D – Amaimersion Sep 24 '18 at 20:18
1

So, according to awesome Jacque Goupil answer, i created my own workaround. It's uses innerHTML.

Algorithm:

  1. Get innerHTML of an element.
  2. Remove entities.
  3. Remove HTML stuff (tags, etc.).
  4. Replace multiple spaces with a single space.
  5. Replace a space between words with a separator.

Warnings:

  • It's just workaround.
  • It's pretty slow and not suitable for regular usage!
  • It's parse HTML with regular expressions. It is really dangerous and can break up all things. Make sure the regular expression is appropriate for your HTML structure.

Code:

/**
 * Returns a text value of the element (and it's childs).
 *
 * @param dcmnt {Document}
 * The `document` where an element will be searched for.
 *
 * @param selector {string}
 * A selector by which will be search.
 *
 * @param separator {string}
 * A separator between the text of an different elements.
 * Defaults to `" "` (one space).
 *
 * @returns {string}
 * If the element was found, then it's text value, else an empty string.
 *
 * Warning!
 * 
 * This method is pretty slow, because it parse HTML slice,
 * not just gets a text value. It is necessary because of elements
 * that was not rendered (i.e. that have `display: none`).
 * `innerText` and `textContent` will return inappropriate result
 * for this kind elements.
 * For more see:
 *
 * @see https://stackoverflow.com/questions/52480730/replace-n-in-non-render-non-display-element-text
 */
function getTextValue(dcmnt, selector, separator) {
    separator = separator || " ";
    const element = dcmnt.querySelector(selector);

    if (!element) {
        return "";
    }

    /**
     * @see https://stackoverflow.com/questions/7394748/whats-the-right-way-to-decode-a-string-that-has-special-html-entities-in-it#7394787
     */
    const _decodeEntities = (html) => {
        const textArea = document.createElement("textarea");
        textArea.innerHTML = html;

        return textArea.value;
    };

    let innerHTML = element.innerHTML;

    // remove entities from HTML, but keep tags and other stuff.
    innerHTML = _decodeEntities(innerHTML);

    // replace HTML stuff with a space.
    // @see https://stackoverflow.com/questions/6743912/get-the-pure-text-without-html-element-by-javascript#answer-6744068
    innerHTML = innerHTML.replace(/<[^>]*>/g, " ");

    // replace multiple spaces with a single space.
    innerHTML = innerHTML.replace(/\s+/g, " ");

    // remove space from beginning and ending.
    innerHTML = innerHTML.trim();

    // for now there only one space between words.
    // so, we replace a space with the separator.
    innerHTML = innerHTML.replace(/ /g, separator);

    return innerHTML;
}

Gist.

Amaimersion
  • 787
  • 15
  • 28