1

I just want to extract the words that don't repeat over the text below. I just want to use regex, and I have seen some similar questions as in Only extract those words from a list that include no repeating letters, using regex (don't repeat letters) and Regular Expression :match string containing only non repeating words. I would like the result to be a list of words that do not repeat appearing in the natural order in which they occur in the text.

My text in common format:

Teaching psychology is the part of educational psychology that refers to school education. As will be seen later, both have the same objective: to study, explain and understand the processes of behavioral change that are produce in people as a consequence of their participation in activities educational What gives an entity proper to teaching psychology is the nature and the characteristics of the educational activities that exist at the base of the of behavioral change studied.

My text in vertical list word for word separately (if it's easier to do using like this) using the answer to this question

Community
  • 1
  • 1
7beggars_nnnnm
  • 697
  • 3
  • 12
  • 1
    You could just pipe the output from the previous question into `sort -u` if you don't mind having the list sorted. – Nick Oct 24 '19 at 23:29
  • 1
    If you are still on Arch Linux, you can only consider doing it with Python and its PyPi regex module because if you want a single regex for this, you will need an infinite-width lookbehind. I'd consider other approaches, like using `uniq`. – Wiktor Stribiżew Oct 24 '19 at 23:29
  • @Nick i have been testing `sort -u` with the text in vertical list format and it really worked. But it didn't work with plain text. Thanksss!! – 7beggars_nnnnm Oct 25 '19 at 00:21
  • @WiktorStribiżew I have python and use it a few times, I will try here to do as you advised although I have never done it this way yet. Greatfull :) – 7beggars_nnnnm Oct 25 '19 at 00:22
  • 1
    `fmt -1|tr -cd 'A-Za-z\n'|awk '{h[w[++i]=$1]++}END{for(i in w)if(h[w[i]]==1)print w[i]}'` – jhnc Oct 25 '19 at 04:02
  • 1
    @DiegoBneiNoah `sort` won't work with the plain text as it operates on a line by line basis. Hopefully you can use it with the vertical list format. – Nick Oct 25 '19 at 05:13
  • Thankx :) !! @jhnc. `fmt -1|tr -cd 'A-Za-z\n'|awk '{h[w[++i]=$1]++}END{for(i in w)if(h[w[i]]==1)print w[i]}'` worked perfectly using a vertical orientation word list. – 7beggars_nnnnm Oct 25 '19 at 21:53

1 Answers1

1

If you need a pure regex solution, you can only do that with .NET or Python PyPi regex because you need two things regex libraries do not usually feature: 1) right-to-left input string parsing and 2) infinite width lookbehind.

Here is a Python solution:

import regex
text="Teaching psychology is the part of educational psychology that refers to school education. As will be seen later, both have the same objective: to study, explain and understand the processes of behavioral change that are produce in people as a consequence of their participation in activities educational What gives an entity proper to teaching psychology is the nature and the characteristics of the educational activities that exist at the base of the of behavioral change studied."
rx = r'(?rus)(?<!\b\1\b.*?)\b(\w+)\b'
print (list(reversed(regex.findall(rx, text))))

See an online demo.

Details

  • (?rus) - r enables right-to-left input string parsing (all patterns in the regular expression match left to right as usual, so the match texts are not reversed), u in Python 2 is used to make \w Unicode aware, it is the default option in Python 3, s is the DOTALL modifier making . match line breaks
  • (?<!\b\1\b.*?) - no match if immediately to the left of the current location, there are any 0+ chars and then the same text as is captured in Group 1 (see later in the expression) as whole word
  • \b(\w+)\b - a whole word, 1+ word chars within word boundaries.

The reversed is used to print the words in the original order, as the right-to-left regex matched them from end to start.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Stribitzew I have been doing `chmod +x list.py` and then list.py exactly with your source code but I have received the following error `Traceback (most recent call last): File "list.py", line 1, in import regex ModuleNotFoundError: No module named 'regex'` . I have to import the regex library via `pypi` or `pip` command? – 7beggars_nnnnm Oct 25 '19 at 20:09
  • 1
    @DiegoBneiNoah `pip install regex`. – Wiktor Stribiżew Oct 26 '19 at 13:07
  • Stribizew Thankx. I have been able to install using this command. When I have time I will test your code again and mark it as resolved if applicable. – 7beggars_nnnnm Oct 26 '19 at 19:50