-3

I have a text retrieved from a search result that contains some words that match the string that's been searched.

I need to truncate the text in a similar way that Google does: enter image description here

The keywords are highlighted, most of the text not containing the keywords are truncated and ellipsis are added, if the keywords appear more than once in the whole text that part is still included. How would you structure a regex in Javascript that does something like this?

Thanks

Giovanni
  • 1,313
  • 1
  • 10
  • 14
  • So basically that's a *help me find a library* question? If yes it's off-topic on SO. – Roko C. Buljan Aug 05 '15 at 11:18
  • see the source of google in developer tools you will know – Raghavendra Aug 05 '15 at 11:18
  • @RokoC.Buljan Some help with a regex that does this would be good :) – Giovanni Aug 05 '15 at 11:21
  • @raghavendra Could you link something more specific? I wasn't aware that they published source code. Thank you for your help :D – Giovanni Aug 05 '15 at 11:21
  • open developer tools inspect the words they are using manually like for the matched it is coming directly from server. so they are generating them in server itself – Raghavendra Aug 05 '15 at 11:23
  • @Giovanni (strange I have to say that...) but simply google for: `javascript jquery highlight words :stackoverflow`, browse all the results, mash up some code - post the best you've tried so far - and than ask a specific question. – Roko C. Buljan Aug 05 '15 at 11:23
  • if you want to do with js means you have to read the content. match them and replace it will some tags with class or use some third party libs – Raghavendra Aug 05 '15 at 11:24
  • @RokoC.Buljan thanks for the insight, but I did search that beforehand. I'm not just trying to accomplish text highlighting, but something that truncates the text as well. – Giovanni Aug 05 '15 at 11:27
  • 1
    @Giovanni You really should post what you have tried so far... – Przemysław Jan Wróbel Aug 05 '15 at 11:28
  • 1
    @Giovanni I realize that. Do you have any code to share so far? I mean it would be of help for anyone to have at least a "working" example code to improve/fix... – Roko C. Buljan Aug 05 '15 at 11:29

1 Answers1

3

Javascript Truncate words like Google

const regEsc = (str) => str.replace(/[-\/\\^$*+?.()|[\]{}]/g, "\\$&");

const string = "Lorem Ipsum is simply dummy book text of the printing and text book typesetting industry. Dummy Lorem Ipsum has been the industry's standard dummy Ipsum text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.";
const queryString = "lorem";

const rgxp = new RegExp("(\\S*.{0,10})?("+ regEsc(queryString) +")(.{0,10}\\S*)?", "ig");
const results = [];

string.replace(rgxp, function(m, $1, $2, $3){
  results.push(`${$1?"…"+$1:""}<b>${$2}</b>${$3?$3+"…":""}`);
});

document.body.innerHTML =  string.replace(rgxp, "<span>$1<b>$2</b>$3</span>") ;
span{background:yellow;}
b{color:red}

The RegExp:

Let's say we have a long string and want to match all book or Book word appearances,
this regex would do it:

/book/ig  

(ig are (case)Insensitive and Global flags)

but we need not only to get book but also some truncated portions of text before and after that match. Let's say 10 characters before and 10 characters after:

/.{0,10}book.{0,10}/ig

. means any character except linebreak, and {minN, maxN} is the quantifier of how many of such characters we want to match.

To be able to differentiate the prefixed chunk, the match and the suffixed chunk so we can use them separately (i.e: for wrapping in <b> bold tags etc.), let's use Capturing Group ()

/(.{0,10})(book)(.{0,10})/ig

The above will match both Book and book in

"Book an apartment and read a book of nice little fluffy animals"

in order to know when to add Ellipsis we need to make those chunks "optional" let's apply Lazy Quantifiers ?

/(.{0,10})?(book)(.{0,10})?/ig

now a capturing group might result empty. Used with a Conditional Operator ?: as boolean you can assert ellipsis like: ($1 ? "…"+$1 : "")

now what we captured would look like:

Book an apartm
nd read a book of nice l

(I've bolded the queryString just for visuals)

To fix that ugly-cutted words, let's prepend (append) any number * of non whitespace characters \S

/(\S*.{0,10})?(book)(.{0,10}\S*)?/ig

The result is now:

Book an apartment
and read a book of nice little

(See above's regex details at regex101)

let's now convert the Regex notation to RegExp String (escaping the backshash characters and putting our ig flags in the second argument).

new RegExp("(\\S*.{0,10})?(book)(.{0,10}\\S*)?", "ig");

Thanks of the use of new RegExp method we can now pass variables into:

var queryString = "book";
var rgxp = new RegExp("(\\S*.{0,10})?("+ queryString +")(.{0,10}\\S*)?", "ig");

Finally to retrieve and use our three captured Groups we can access them inside the .replace() String parameter using "$1", "$2" and "$3" (See demos).
or also for more freedom we can use instead of String Parameter a callback function passing the needed arguments .replace(rgxp, function(match, $1, $2, $3){

Note:

This code will not return overlapping matches. Let's say we search in the above string for "an". it'll not return two matches for "an" & "and" but only for the first "an" since the other one is too close the the first one, and the regex already consumed the later characters due to the up-to-Max 10 in .{0,10}. More info.

If the source string has HTML tags in it, make sure (for ease sake) to search only trough the text content only (not the HTML string) - otherwise a more complicated approach would be necessary.

Useful resources:

https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/RegExp
https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/String/replace
http://www.rexegg.com/regex-quickstart.html

Roko C. Buljan
  • 196,159
  • 39
  • 305
  • 313