1

I’ve indexed the text of PDF’s files on my database, but sometimes the text is not clean and we have spaces between words:

var text = 'C or P ora te go V ernan C e report M ANA g EMENT bO A r D AND s u PE r V is O r y bO A r D C OMM i TTEE s The Management Board has not currently established any committees.';

I want make a front-end search engine for my users, but I need to know the START and END position of each search (Based on the original text, with spaces).

I can do that with a regex, for example if I do:

text.toLowerCase().search(/m ? a ? n ? a ? g ? e ? m ? e ? n ? t/);

I find the word “Management” on start position letter 36. Now, I want know the “End position” of the word (Because I don’t know how much spaces are on the word, so I don’t know how much letters), and I want the search to be multi-matches (To give me the start/end position of multiple results).

Can you help me with that? Again, it’s very important for me to have the start/end position of each words based on the original text, removing spaces then search is not a good solution for me.

I’m also curious to know if I can do that without a regex.

Thank you!

Zlitus
  • 108
  • 1
  • 5
  • 1
    Possible duplicate of [Return positions of a regex match() in Javascript?](https://stackoverflow.com/questions/2295657/return-positions-of-a-regex-match-in-javascript) – Zenoo Mar 15 '18 at 12:50
  • 3
    Looks like you'd be much better off figuring out why whatever code you're using to index the PDFs is mangling the text. – Jim Mischel Mar 15 '18 at 13:21

1 Answers1

0

This demo might help:

> text.toLowerCase().match(/m *a *n *a *g *e *m *e *n *t/)
[ 'm ana g ement',
  index: 36,
  input: 'c or p ora te go v ernan c e report m ana g ement bo a r d and s u pe r v is o r y bo a r d c omm i ttee s the management board has not currently established any committees.' ]

(I modified your regex to use ' *' between each letter, to match any number of spaces including 0. Your ' ? ' example would only match exactly 1 or 2 spaces between each letter.)

Using the .match method returns returns the captured expression and index (as seen above) if the regex is matched, otherwise null. You should be able to use this to do something along these lines:

const matches = text.toLowerCase().match(/m *a *n *a *g *e *m *e *n *t/);
if (matches) {
    const start = matches.index;
    const end = matches.index + matches[0].length - 1;
}
Joe Lafiosca
  • 1,646
  • 11
  • 15