0

I have a large (more than 852000 lines) text file with song verses and each of them is preceded by a number, such as 1., 134-20. or 1231., but not one led by a hashtag like #45345. Each verse may contain four or more lines. There are also variations of lines, which should be ignored for now.

This is my code I have been struggling with no good results so far:

$.ajax({url:"LD.txt",dataType:'text',success:function(data){
//var lines=data.match(/(.*)\r\n(^[A-Z].*)+/mg);
var lines=data.match(/(.*)(^[A-Z].*)+/mg);
for(var i=0;i<50/*lines.length*/;i++){
var line=lines[i].replace("\r\n","");console.log(i+" "+line);
}}});

This is a part of utf-8 text file:

/* 1970  #1.#  PAR DZIESMĀM UN DZIEDAŠANU
#1. Dziesmas un dziedašana vispāriga tautas manta un cilvēka mūža pavadoņi.
1.Dziesmas visai Latvijai kopeja manta. */

15.
Dziesmiņ' mana, kā dziedama,
Ne ta mana pamanita;
Vecā māte pamācija,
Aizkrāsnē tupedama.
#279a.

16.
Māci, māte, man' dziedāt,
Māc' ar vienu Dieva dziesmu,
Ko dziedās dvēselite,
Pie Dieviņa aizgājuse.
#15b,27b.
16-1.
Māci mani, māmuliņa,
Jele vienu Dieva dziesmu,
Ko dziedās dvēselite,
Dieva duru dagājuse.
#33d.
4:[Dieva durvim pieiedama].
#137e.
4:[Debes durvim piegājusi].
#111d.
4:[Pie Dieviņa debesīs].
#137c.
16-2.
Māci mani, māmulite,
Māci kādu Dieva dziesmu,
Ko dziedašu, nogājuse
Pie Dieviņa nama durvu.
Ne Dieviņis iekšā laida,
Ne eņģeļi vārtus vēra.
#218.
16-3.
Tēvis, tēvis, māte, māte,
Mācat mani pātaros;
Es āziešu pie Dieviņa
Bez neviena pātariņa.
#172i.

17.
Dzied', māsiņ, dzied', māsiņ,
Vedīs tautas šoruden;
Atstāj savas skaistas dziesmas
Jaunajām māsiņām.
#324a.

18.
Es dziedāt nedziedaju (nevareju),
Sacīt vien pasaciju,
Sacīìt vien pasaciju
Jaunajām māsiņām.
#137d,324a.

19.
Saki dziesmas, bāleliņ,
Jaunakām (Mazajām) māsiņām,
Tu staigaji tālu zemi,
Tu dziesmiņu daudz dzirdeji (zinaji).
#126c,141.
3:[Tu bij' tālu izstaigajis],
#354,367.
19-1.
Stāsti dziesmas, bāleliņ,
Jaunakai māsiņai,
Tu bij' tālu izstaigajis,
Tu bij' daudzi izredzejis.
#137d.

20.
Dziedat, meitas, ar manim,
Man bij (ir) daudz skaistu dziesmu;
Pa vienai salasiju,
Svešu zemi (Svešas zemes) staigadams.
#28d,68a,335 N% 107,325c,400.
1:[Dziedait], meitas ar manim,
#137b N% 2209,391 N% 33.
1:[Dziedait], meitas, ar manim,
2:[Ar ir] daudz skaistu dziesmu;
#41c N% 3 pag 196.
1:Dziedat, meitas [ar maniem],
#6c,68d,88b,95d,104,121n.
2:Man bij (ir) daudz [svešu dziesmu];
#111c,325e,379.
4:Svešu zemi (Svešas zemes) [staigajot].
#98a,226f,314,407b.
4:[Tāļas zemes] staigadams.
#91c,122b.
4:[ļaužu zemi] staigadams.
#3b.
20-1.
Dziedat, meitas, ar manim,
Man ir gan greznu dziesmu;
Svešas zemes izstaigaju,
Pa vienai lasidams.
#293a.
20-2.
Palīdziet man dziedāt,
Man bij daudz skaistu dziesmu;
Pa vienai salasiju
Pa pasauli staigadams.
#91c.
20-3.
Mīlit mani jūs, māsiņas,
Man bij daudz svešu dziesmu,
Es atnācu pār jūriņu
Svešā dziesmu laiviņā.
#146a.
20-4.
Es staigaju tāļu zemi
Man dziesmišu vācelite;
Es savām māsiņām
Svešas dziesmas skandinašu.
#126c.

21.
Skaisti dziedu dziedadama,
Gauži raudu raudadama:
Bāliņš skaisti dziedenaja,
Tautiets gauži raudenaja.
#28f.

22.
Kad es dziedu, koši (grazni) dziedu,
Kad es raudu, žēli raudu.
Kā es koši (grazni) nedziedašu,
Māŗa dziesmu teicejiņa;
Kā es žēli neraudašu,
Kad es biju (augu) sērdienite.
#156,161c,164c,173,335 N% 208.
/*Bitnera L. ļaužu dz.kā motto.*/
2:[Kad raudaju], žēli raudu.
#188.
1:Kad es [dziežu], koši (grazni) [dziežu],
2:Kad es [raužu], žēli [raužu].
4:[Māres dziesmu] teicejiņa;
#185c.
4:[Māras dziesmu] teicejiņa;
#391 N% .

22.
Mīļa Laima, Dieva meit',
Nāc dziesmiņas darināt:
Teic dziesmiņas, dziedi pati
Par jauniem, par veciem.
#335 N% 66.

23.
Teic dziesmiņu (dziesmiņas), sērdienite,
Tu dziesmiņu daudz zinaji;
Ne tev tēva, ne māmiņas,
Dziesmiņās (Dziesmās vien) remdejies.
#10,28d,41c N% 41 pag 196, 68d, 119a, 174c, 184, 224d N% 41, 229a, 276c, 281c, 295a, 319c, 364, 379, 391 N% 8, 403.
4:[Dziesmiņām] remdejies.
#215b.
1:[Dzied' dziesmiņu], sērdienite,
#73b,377.
1:Teic dziesmiņu (dziesmiņas), [bārenite],
#98a,157,401.
2:Tu dziesmiņu [dievszingan];
#156,267b.
2:Tu dziesmiņu [dievsungan];
#88b.
23-1.
Teic man dziesmas, sērdienite,
Tu dziesmiņu daudz zinaji;
Tev nav tēva, māmuliņas,
Dziesmiņās remdejies.
#89b,161c,171a.
4:[Dziesmiņām] remdejies.
#176a N% 3706.
4:[Tu dziesmās] remdejies.
#278c.
3:[Tev nomira tēvs, māmiņa],
#152a.
23-2.
Saki dziesmas, bārenite,
Tev ir daudzi skaistu dziesmu:
Tev nav tēva, māmuliņas,
Dziesmas vien darinaji.
#311a.

24.
Teic man dziesmas, meža meita,
Tu dziesmiņu daudz zinaji,
Tev pateica lakstigala,
Krūmiņā sēdedama.
#214c,325c,348.
1:Teic man dziesmas, [meža māte],
#22h,111d.

25.
Visas manas labas dziesmas
Ceļa vidu aizgājušas;
Gan es citas salasišu
Svešu zemi staigadams.
#73b.

26.
Visas dziesmas izdziedatas,
Kur mēs citas dabusim?
Iesim dziesmu kambarī,
Tur mēs citas dabusim.
#6c,18,379,410.
3:Iesim dziesmu [krodziņā],
#172i,190f.
1:[Ta dziesmiņa pagalam],
#41c.
5:[Tur sēd divas jaunas meitas,]
6:[Dziesmas vien rakstidamas;]
7:[Ko ta viena izrakstija,]
8:[To ta otra izdziedaja.]
#379.
5:[Tur stāveja jaunas meitas],
7:Ko ta viena [nodziedaja],
8:To ta otra [pierakstija],
#68a,335 N% 124.
5:Tur sēd [divi] jaunas meitas,
7:Ko ta viena [sarakstija],
#41c N% 1 pag 195.
26-1.
Ta ziņģite pagalam,
Kur mēs citu dabusim?
Iesim ziņģu krodziņā,
Tur mēs citu dabusim;
Tur bij divas jumpraviņas,
Ziņģes vien darinaja.
#158.
2:Kur mēs citu [dabuisim]?
4:Tur mēs citu [dabuisim];
5:Tur bij [viena jumpraviņa],
#149a N% 1479.
1:[To ziņģiti pabeidzām];
7:[Ko ta viena sadomaja],
8:[To ta otra uzrakstija].
#401.
5:[Tur sēdeja trīs jumpravas,]
6:[Ziņģes vien darinaja;]
7:[Ko tās divas izziņģeja,]
8:[To trešā sarakstija.]
#325e.

/* 4.Dziesmu pūra pielocišana un bagatiba. */

27.
Bāliņos dzīvodama
Dziesmas tinu kamolī (kamolā);
Kad aizgāju (izgāju, nogāju) tautiņās,
Pa vienai šķetinaju.
#401.
4:[Tad vaļā] šķetinaju.
#137d.
1:[Kad es augu bāliņos],
#379,207r.
1:[Kad es augu pie māmiņas],
4:Pa vienai [risinaju].
#230.
1:[Pie māmiņas dzīvodama]
#19d,264c.
1:[Jauns būdams ganos gāju]
4:[No kamoļa] šķetinaju.
#319b N% 309,391 N% 29,109b.
1:[Kad es maza ganos gāju],
#19c.
1:[Pie bāliņa ganos gāju]
4:Pa vienai [risinaju].
#223b.
1:[Kad es biju jauna meita],
#41a,352,389.
4:Pa vienai [ritinaju].
#241f.

27-1.
Es satinu savas dziesmas
Baltā diega kamolī (kamolā);
Kad aizgāju tautiņās,
Pa vienai ritinaju.
#264g,363.
1:Es [ietinu] savas dziesmas
4:Pa vienai [risinaju].
#207m.
3:[Brāļam jāju vedibās],
#232a.
27-2.
Visas savas skaistas dziesmas
Es satinu kamolā,
Kad izgāju tautiņās,
Pa vienai šķetinaju.
#112a.
27-3.
Visas dziesmas izdziedatas,
Nu satinu kamolī;
Kad es iešu tautiņās,
Pa vienai ritinašu.
#219b.
27-4.
Man bij dziesmu kamolits
Smalkā lagzdu krūmiņā;
Kad izgāju (aizgāju) tautiņās,
Pa vienai ritinaju.
#22h.
4:Pa vienai [tecinaju].
#379,412,408 N% 758.
27-5.
Kad es biju jauna meita,
Man bij dziesmu vācelite,
Kad es gāju (aizgāju) tautiņās,
Pa vienai ritinaju.
#172q,196a.
1:[Man iedeva māmulite]
2:[Mazu dziesmu vāceliti],
#181c.
27-6.
Man bij dziesmu vācelite
Smalku nātru krūmiņā;
Kad aizgāju tautiņās,
Pa vienai izdziedaju.
#298d.
2:[Skaidienā glabajama];
4:Pa vienai [darinaju].
#379.
27-7.
Man dziesmiņu trīs pūriņi
Brāļ' apeņu dārziņā;
Kad aiziešu tautiņās,
Pa vienai izdziedašu.
#196i.

The javascript solution I am looking for is such where in the text input when exact word dziedama is provided, the results should be returned as

15. Dziesmiņ' mana, kā <b>dziedama</b>, Ne ta mana pamanita; Vecā māte pamācija, Aizkrāsnē tupedama.

I.e., the preceding number (which can be even several lines above, e.g., 23-2. with following verse lines, as when searching darinaji) above of the verse part which contains the searched part + "tab" + verse with the searched part of word in bold.

If one provides a part of the word with asterisk dzie*, the results should look like this showing the full word in bold.

15. <b>Dziesmiņ'</b> mana, kā <b>dziedama</b>, Ne ta mana pamanita; Vecā māte pamācija, Aizkrāsnē tupedama.
16. Māci, māte, man' <b>dziedāt</b>, Māc' ar vienu Dieva <b>dziesmu</b>, Ko <b>dziedās</b> dvēselite, Pie Dieviņa aizgājuse.
16-1.   Māci mani, māmuliņa, Jele vienu Dieva <b>dziesmu</b>, Ko <b>dziedās</b> dvēselite, Dieva duru dagājuse.
...

In similar way it should look for asterisk in begining of the word *esmu, returning verses that contain either dziesmu, iesmu, Dievadziesmu etc. with variable length hidden behind the asterisk * in any part of the word.

If dzied? is written in the query besides letters, it should return verses that have either dziedu, dziedi or alike, i.e. with one utf-8 character hidden behind the ?, which can be put in any part of the word.

If the search query is within double quotes like vienu Dieva, it should look for the exact word sequence in the verses.

It should be available to search the diacritics-full text, also by providing diacritics-less normalized characters.

Thank you in advance!

ugisu
  • 21
  • 4

1 Answers1

1

Ok, the regex to match an entire verse starting with a number on a line by itself that contains the complete word xxxxx should be:

^[0-9]+\.$(?:.(?!^[0-9]+\.$))+\b(xxxxx)\b.*?(?=^[0-9]+\.$) with the flags gmsu

Where:

  • ^[0-9]+\.$ matches a line with a number
  • (?:.(?!^[0-9]+\.$))+ matches characters that aren't followed by a line with a number
  • \b(xxxxx)\b matches xxxxx as a complete word
  • .*?(?=^[0-9]+\.$) matches the smallest amount that is followed by a line with a number

But there are problems: \b is defined to assert a position at a word boundary: (^\w|\w$|\W\w|\w\W) \W and \w are defined to be [^a-zA-Z0-9_] and [a-zA-Z0-9_] respectively. But that doesn't support Unicode at all.

According to What's the correct regex range for javascript's regexes to match all the non word characters in any script? the \W equivalent for Unicode would be [^\p{L}\p{N}\p{M}\p{Pc}] so using that same logic \w would be [\p{L}\p{N}\p{M}\p{Pc}].

So if we use look-arounds with those Unicode patterns instead of \b the regex to match an entire verse starting with a number on a line by itself that contains xxxxx should be:

^[0-9]+\.$(?:.(?!^[0-9]+\.$))+(?<=^|[^\p{L}\p{N}\p{M}\p{Pc}])(xxxxx)(?=$|[^\p{L}\p{N}\p{M}\p{Pc}]).*?(?=^[0-9]+\.$) with the flags gmsu

Where:

  • ^[0-9]+\.$ matches a line with a number
  • (?:.(?!^[0-9]+\.$))+ matches characters that aren't followed by a line with a number
  • (?<=^|[^\p{L}\p{N}\p{M}\p{Pc}]) matches a left word boundary considering Unicode
  • (xxxxx) matches xxxxx as a complete word
  • (?=$|[^\p{L}\p{N}\p{M}\p{Pc}]) matches a left right boundary considering Unicode
  • .*?(?=^[0-9]+\.$) matches the smallest amount that is followed by a line with a number

But there are problems: It would work if your xxxxx were the exact letters you were looking for but what about the * and ? you wanted to use?

Well, to handle * and ? we need to take the user input (potentially including * and ?) and make regex-y replacements for them:

  1. Acquire the user input
  2. Escape all characters special to regex with a backslash (\)
  3. Replace \? with [\p{L}\p{N}\p{M}\p{Pc}]
  4. Replace \* with [\p{L}\p{N}\p{M}\p{Pc}]+

Now we could insert this adjusted input where the xxxxx is in this regex:

^[0-9]+\.$(?:.(?!^[0-9]+\.$))+(?<=^|[^\p{L}\p{N}\p{M}\p{Pc}])(xxxxx)(?=$|[^\p{L}\p{N}\p{M}\p{Pc}]).*?(?=^[0-9]+\.$) with the flags gmsu

Where:

  • ^[0-9]+\.$ matches a line with a number
  • (?:.(?!^[0-9]+\.$))+ matches characters that aren't followed by a line with a number
  • (?<=^|[^\p{L}\p{N}\p{M}\p{Pc}]) matches a left word boundary considering Unicode
  • (xxxxx) matches xxxxx as a complete word
  • (?=$|[^\p{L}\p{N}\p{M}\p{Pc}]) matches a left right boundary considering Unicode
  • .*?(?=^[0-9]+\.$) matches the smallest amount of stuff that is followed by a line with a number

But there are problems: The match groups only include the last occurrence of the input because there is only one capture group in the pattern.

Like https://stackoverflow.com/a/37004214/2193968 says:

With one group in the pattern, you can only get one exact result in that group. If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.

That means that we will only match the last occurrence even if we adjust the regex to have a repeated capture group like this:

^[0-9]+\.$(?:(?:.(?!^[0-9]+\.$))+?(?<=^|[^\p{L}\p{N}\p{M}\p{Pc}])(xxxxx)(?=$|[^\p{L}\p{N}\p{M}\p{Pc}]))+.*?(?=^[0-9]+\.$) with the flags gmsu

Where:

  • ^[0-9]+\.$ matches a line with a number
  • (?: ... )+ matches a repeated group of:
    • (?:.(?!^[0-9]+\.$))+ matches characters that aren't followed by a line with a number
    • (?<=^|[^\p{L}\p{N}\p{M}\p{Pc}]) matches a left word boundary considering Unicode
    • (xxxxx) matches xxxxx as a complete word
    • (?=$|[^\p{L}\p{N}\p{M}\p{Pc}]) matches a left right boundary considering Unicode
  • .*?(?=^[0-9]+\.$) matches the smallest amount of stuff that is followed by a line with a number

There is no way to have the groups capture all the occurrences.

So what I recommend is:

  1. Acquire the user input

  2. Escape all characters special to regex with a backslash (\)

  3. Replace \? with [\p{L}\p{N}\p{M}\p{Pc}]

  4. Replace \* with [\p{L}\p{N}\p{M}\p{Pc}]+

  5. Insert that adjusted input where the xxxxx is in this regex:

    ^[0-9]+\.$(?:.(?!^[0-9]+\.$))+(?<=^|[^\p{L}\p{N}\p{M}\p{Pc}])xxxxx(?=$|[^\p{L}\p{N}\p{M}\p{Pc}]).*?(?=^[0-9]+\.$) with the flags gmsu

    To see it for the word dziedās, have a look at https://regex101.com/r/eEuk38/1

  6. Store the resulting matches as the output.

  7. Find the bold-able matches with this regex using that output:

    (?<=^|[^\p{L}\p{N}\p{M}\p{Pc}])xxxxx(?=$|[^\p{L}\p{N}\p{M}\p{Pc}]) with the flags gmsu

    To see it for the word dziedās, have a look at https://regex101.com/r/I0Psxp/1

So for example:

  • The input ?zied* would become:

    [\p{L}\p{N}\p{M}\p{Pc}]zied[\p{L}\p{N}\p{M}\p{Pc}]+

  • Find the verses that match with:

    ^[0-9]+\.$(?:.(?!^[0-9]+\.$))+(?<=^|[^\p{L}\p{N}\p{M}\p{Pc}])[\p{L}\p{N}\p{M}\p{Pc}]zied[\p{L}\p{N}\p{M}\p{Pc}]+(?=$|[^\p{L}\p{N}\p{M}\p{Pc}]).*?(?=^[0-9]+\.$) with the flags gmsu

  • Take the matches and bold the words with:

    (?<=^|[^\p{L}\p{N}\p{M}\p{Pc}])[\p{L}\p{N}\p{M}\p{Pc}]zied[\p{L}\p{N}\p{M}\p{Pc}]+(?=$|[^\p{L}\p{N}\p{M}\p{Pc}]) with the flags gmsu

Jerry Jeremiah
  • 9,045
  • 2
  • 23
  • 32