Finding text strings in JavaScript

Question

I have a large valid JavaScript file (utf-8), from which I need to extract all text strings automatically.

For simplicity, the file doesn't contain any comment blocks in it, only valid ES6 JavaScript code.

Once I find an occurrence of ' or " or `, I'm supposed to scan for the end of the text block, is where I got stuck, given all the possible variations, like "'", '"', "\'", '\"', '", `\``, etc.

Is there a known and/or reusable algorithm for detecting the end of a valid ES6 JavaScript text block?

UPDATE-1: My JavaScript file isn't just large, I also have to process it as a stream, in chunks, so Regex is absolutely not usable. I didn't want to complicate my question, mentioning joint chunks of code, I will figure that out myself, If I have an algorithm that can work for a single piece of code that's in memory.

UPDATE-2: I got this working initially, thanks to the many advises given here, but then I got stuck again, because of the Regular Expressions.

Examples of Regular Expressions that break any of the text detection techniques suggested so far:

/'/
/"/
/\`/

Having studied the matter closer, by reading this: How does JavaScript detect regular expressions?, I'm afraid that detecting regular expressions in JavaScript is a whole new ball game, worth a separate question, or else it gets too complicated. But I appreciate very much if somebody can point me in the right direction with this issue...

UPDATE-3: After much research I found with regret that I cannot come up with an algorithm that would work in my case, because presence of Regular Expressions makes the task incredibly more complicated than was initially thought. According to the following: When parsing Javascript, what determines the meaning of a slash?, determining the beginning and end of regular expressions in JavaScript is one of the most complex and convoluted tasks. And without it we cannot figure out when symbols ', '"' and ` are opening a text block or whether they are inside a regular expression.

@Xotic750 that looks good for some serious parsing, but for my case it is probably an overkill. — vitaly-t, Dec 25 '15 at 09:59
@connexo, as per the update in my question - because the input file is too large to use Regular expressions. — vitaly-t, Dec 25 '15 at 10:04
And what would you expect with something like `var x={ a:1, 'b':2};` String/s or not? — Xotic750, Dec 25 '15 at 10:05
@Xotic750, I would expect the code to detect `'b'` as a text string, i.e. I need all text strings, regardless of the context. — vitaly-t, Dec 25 '15 at 10:07
`a` is also a string as it is an object key, just not quoted. — Xotic750, Dec 25 '15 at 10:08
@Xotic750 I don't care about the meaning, only about the declaration syntax. — vitaly-t, Dec 25 '15 at 10:09
I would still go with something like Esprima and search through the raw values for quoted values. I see no point in reinventing the wheel. (it may even have some clever options to allow you to do what you want directly) — Xotic750, Dec 25 '15 at 10:20
@Xotic750 I tried to keep my question simple and precise. For me this task is only a part of a larger, much more complex parsing algorithm. So it is not reinventing the wheel for me, rather fixing a small piece of something much larger. — vitaly-t, Dec 25 '15 at 10:25
Did you try already try regex on your files or you're assuming it won't work or will be too slow? Also, what do you mean by "large"? Few MBs? Few GBs? — Shanoor, Dec 25 '15 at 10:26
Have a look through their source code and see how they deal with it. — Xotic750, Dec 25 '15 at 10:27
@ShanShan do you think if I run Regex against a JavaScript file of some 10MByte+ then its slow speed would be a far-fetched assumption? — vitaly-t, Dec 25 '15 at 10:28
I don't know if it's easily usable in your case (you're reading chunk by chunk) but using `event-stream` to read a 8MB file line by line (170k lines), it takes half of second to get lines containing a specific string. Not so slow IMO. — Shanoor, Dec 25 '15 at 10:52

score 4 · Accepted Answer · answered Dec 26 '15 at 12:50

4

The only way to parse JavaScript is with a JavaScript parser. Even if you were able to use regular expressions, at the end of the day they are not powerful enough to do what you are trying to do here.

You could either use one of several existing parsers, that are very easy to use, or you could write your own, simplified to focus on the string extraction problem. I hardly imagine you want to write your own parser, even a simplified one. You will spend much more time writing it and maintaining it than you might think.

For instance, an existing parser will handle something like the following without breaking a sweat.

`foo${"bar"+`baz`}`

The obvious candidates for parsers to use are esprima and babel.

By the way, what are you planning to do with these strings once you extract them?

answered Dec 26 '15 at 12:50

It would have been more useful, if you were more specific in your suggestions, like starting with this: https://astexplorer.net/. Understand that for any newcomer, AST parsers can be difficult to come around to and to figure out which one to use and why. – vitaly-t Jan 01 '16 at 13:04
I hardly see how I could have more specific than suggesting esprima and babel. Esprima has an easily-findable online sandbox. – Jan 01 '16 at 13:06
When I was asking the question I had never even heard of AST parsers, much less to understand how they could help me. I do now though, after a very extensive research on the subject. That's what I meant when I said your answer wasn't specific enough for someone without any experience with AST parsers. – vitaly-t Jan 01 '16 at 13:09
And I would argue the statement `The obvious candidates for parsers to use are esprima and babel.`. The choice is far from obvious, if you look at https://astexplorer.net/. They all have pro-s and cons. – vitaly-t Jan 01 '16 at 13:14
@vitaly-t Sorry for not being sensitive to issues faced by those not acquainted with parsers and for the cavalier attitude implied by my "just use a parser" answer. In my defense, I doubt if the SO Q&A format is the right forum for an introduction to the notion of JS parsers, but on the other hand I could have provided a simple example. – Jan 01 '16 at 13:26
I'm accepting the answer, more because in my case even the question is no longer relevant, as I've moved way pass it and finished everything I wanted with the help of esprima. The other question - http://stackoverflow.com/questions/34524618/enumerate-regular-expressions-via-uglifyjs was way more interesting and a very practical example on the subject. – vitaly-t Jan 01 '16 at 13:34

Roland Illig · Answer 2 · 2015-12-25T10:09:56.303

0

If you only need an approximate answer, or if you want to get the string literals exactly as they appear in the source code, then a regular expression can do the job.

Given the string literal "\n", do you expect a single-character string containing a newline or the two characters backslash and n?

In the former case you need to interpret escape sequences exactly like a JavaScript interpreter does. What you need is a lexer for JavaScript, and many people have already programmed this piece of code.
In the latter case the regular expression has to recognize escape sequences like \x40 and \u2026, so even in that case you should copy the code from an existing JavaScript lexer.

See https://github.com/douglascrockford/JSLint/blob/master/jslint.js, function tokenize.

edited Dec 25 '15 at 10:09

answered Dec 25 '15 at 10:00

Roland Illig

40,703
10
88
121

To phrase it simply, once I find the index of a text-opening symbol, like `'` or `"` or \`, I need to find the index of the corresponding text-closing symbol. And i'm not sure that RegEx would be a good solution for large JavaScript files. – vitaly-t Dec 25 '15 at 10:03
That method `tokenize` seems to make up most of the library, it is huge. I was hoping for something simpler, if possible. – vitaly-t Dec 26 '15 at 13:05

Andriy Ivaneyko · Answer 3 · 2015-12-25T11:55:47.180

-1

Try code below:

 txt = "var z,b \n;z=10;\n b='321`1123`321321';\n c='321`321`312`3123`';"
 function fetchStrings(txt, breaker){
      var result = [];
      for (var i=0; i < txt.length; i++){
        // Define possible string starts characters
        if ((txt[i] == "'")||(txt[i] == "`")){
          // Get our text string;
          textString = txt.slice(i+1, i + 1 + txt.slice(i+1).indexOf(txt[i]));
          result.push(textString)
          // Jump to end of fetched string;
          i = i + textString.length + 1;
        }
      }
      return result;
    };

console.log(fetchStrings(txt));

edited Dec 25 '15 at 11:55

answered Dec 25 '15 at 10:51

Andriy Ivaneyko

20,639
6
60
82

If you remove `;` in the end of the input string, the algorithm no longer works... – vitaly-t Dec 25 '15 at 11:28
@vitaly-t Thanks, code updated, just `txt.slice(i+1,-1)` replaced to `txt.slice(i+1)`. Hope it would be use full for you. – Andriy Ivaneyko Dec 25 '15 at 11:56
You define `fetchStrings` to take a `breaker` argument, but never use it. Also, is this going to work with strings like `"foo\"bar"`? – Dec 25 '15 at 18:11
Please see my Update-2 in the question. – vitaly-t Dec 26 '15 at 08:22

Finding text strings in JavaScript

3 Answers3

Linked