1

I'm trying to split the content of a file into chunks of a certain size (say 40000 characters including whitespaces and what-not)

But what I have splits the array when there's a line change as well, which is unwanted behavior.

        var files = $('#upload').get(0).files;
        if (files.length > 0) {
            var reader = new FileReader()
            reader.onloadend = function () {
                var content = reader.result
                var buffer = 40000
                var contentList = content.match(new RegExp('.{1,' + buffer + '}', 'gm'))
                console.info('list : ', contentList)
            }
            reader.readAsBinaryString(files[0])
        }

A an extra question, I can see that there's no indications of a new line in the file been read, although there's clearly multiple lines in the file. I'm missing something like \n once in a while.

Michael Tot Korsgaard
  • 3,892
  • 11
  • 53
  • 89
  • Possibly a duplicate of [Javascript regex multiline flag doesn't work](http://stackoverflow.com/questions/1068280/javascript-regex-multiline-flag-doesnt-work); what happens when you use `[\s\S]` instead of `.`? – apsillers Jan 23 '17 at 21:34
  • @apsillers: How would you translate `[/s/S]` into my RegExp? – Michael Tot Korsgaard Jan 23 '17 at 21:37
  • `new RegExp('[\\s\\S]{1,' + buffer + '}', 'gm')` I think. – apsillers Jan 23 '17 at 21:38
  • Is it possibly just a problem with the event listener firing whenever you change the line? How many times is the reader.onloadend running? Or am I misinterpreting this: "But what I have splits the array when there's a line change as well, which is unwanted behavior." – Zargold Jan 23 '17 at 21:38
  • @apsillers: now it returns an array with two objects, both containing nothing but the string 's' – Michael Tot Korsgaard Jan 23 '17 at 21:41
  • have you tried the `m` flag in regex..? Like `.match(/[\s\S]{40000}/gm);` – Redu Jan 23 '17 at 21:41
  • @Redu The code does use the `m` modifier already (with the `'gm'` argument). Using it a literal should not have any different effect. – apsillers Jan 23 '17 at 21:42
  • 1
    @MichaelTotKorsgaard I mistyped the comment originally, sorry. How does the current comment do (with `[\\s\\S]` to escape the slashes within the string)? When I do `"foo\nbar".match(new RegExp("[\\s\\S]{1,400}", "gm"))` I see a complete match with a newline char in the middle – apsillers Jan 23 '17 at 21:44

1 Answers1

1

UPDATE: I've just looked at what the XRegExp library does to support capturing newline characters, and it is very simple: It just replaces all . characters (which match everything except newlines) with the character class [\s\S], which matches all characters period. This works because \s matches a specific set of whitespace characters, and \S (capital s) matches the exact opposite of \s. Take the union of the two, and there's no character that won't match. So, @apsillers' suggestion is exactly correct: Replace your dot with [\s\S] to match any character.


What you're looking for is called "single line mode", and unfortunately, JavaScript doesn't support it:

A couple of options:

  1. As suggested in that blog, you could use the XRegExp library.

  2. You could try replacing newlines with a Unicode code point you're certain won't show up in your data, and then replacing it back after doing the RegExp match:

    var input = ...;
    
    var inputSingleLine = input.replace(/\n/g, "\u27BF");
    
    var contentList = inputSingleLine.match(new RegExp('.{1,' + buffer + '}', 'gm'));
    
    for (var index = 0; index < contentList.length; index++)
        contentList[index] = contentList[index].replace(/\u27BF/g, "\n");
    
    console.info('list : ', contentList);
    

    (this assumes you can get the entire file, including all line breaks, into a single variable before you start matching)

Jonathan Gilbert
  • 3,526
  • 20
  • 28