1

I have this text bellow (in this format), and I want the words to be separated and placed one by one in the order they occur in a vertical list like this example https://stackoverflow.com/a/21672824/10824251. I try egrep -vi "'?[^\\p{L}']+'?|^'|'$" mytext.txt > output.txt but I got no result just that output.txt had no (empty) content.

My text:

Teaching psychology is the part of education psychology that refers to school education. As will be seen later, both have the same goal: study, explain and understand the processes of behavioral change that are produce in people as a consequence of their participation in activities educational What gives an entity proper to teaching psychology is the nature and the characteristics of the educational activities that exist at the base of the of behavioral change studied.

My text in Portuguese:

A psicologia do ensino é a parte da psicologia da educacão que se refere à educacão escolar. Como se verá mais adiante, ambas têm um mesmo objetivo: estudar, explicar e compreender os processos de mudanca comportamental que se produzem nas pessoas como uma conseqüência da sua participacão em atividades educativas. O que confere uma entidade própria à psicologia do ensino é a natureza e as caracterís- ticas das atividades educativas que existem na base dos processos de mudanca comportamental estudados.

7beggars_nnnnm
  • 697
  • 3
  • 12
  • @WiktorStribiżew the regex code in the example in Java is the element I wanted to point out in the post. I usually use egrep -vi before, so it would be `egrep -vi "'?[^\\p{L}']+'?|^'|'$" mytext.txt > output.txt` – 7beggars_nnnnm Oct 24 '19 at 21:50
  • 1
    Note that regex is to be used in a *`split`* command. `grep` is not splitting, it is extracting. You want something like `grep -oE '[[:alnum:]]+'` or `grep -oE '[[:alpha:]]+' mytext.txt > output.txt` – Wiktor Stribiżew Oct 24 '19 at 21:51
  • And how can I do this without using `split` and the code understands where it starts and for a word. – 7beggars_nnnnm Oct 24 '19 at 21:53
  • 1
    See https://ideone.com/eB5FFB. Do you need to also get numbers? What way do you want to tokenize the text? – Wiktor Stribiżew Oct 24 '19 at 21:54
  • Thankful, it really works. But I will need to work with texts with diacritics such as these accents `^`, `~`, etc. I did a test and it breaks line when it finds these signs. I may have to open a new question for this. – 7beggars_nnnnm Oct 24 '19 at 22:06
  • 1
    Do you mean you need to split with whitespace? `grep -oE '[^[:space:]]+' mytext.txt > output.txt` Do not post the same question, just explain what criteria the extracting pattern should meet. – Wiktor Stribiżew Oct 24 '19 at 22:09
  • @WiktorStribiżew I have added the text in Portuguese having the accents I mentioned. Please try to execute the same code for this new text and you will see that it breaks the lines where there are accents in the resulting file. – 7beggars_nnnnm Oct 24 '19 at 22:16
  • @WiktorStribiżew grep -oE '[^[:space:]]+' has solved with accents too, sorry for not realizing it before. – 7beggars_nnnnm Oct 24 '19 at 22:18
  • 1
    `grep -oP '[\p{L}\p{M}\p{N}]+'` may work, too. – Wiktor Stribiżew Oct 24 '19 at 22:18
  • ThankX:) @WiktorStribiżew. `grep -oP '[\p{L}\p{M}\p{N}]+'` worked correctly too. – 7beggars_nnnnm Oct 24 '19 at 22:25

1 Answers1

1

You may want to tokenize texts by whitespace:

grep -o '[^[:space:]][^[:space:]]*' mytext.txt > output.txt
grep -o '[^[:space:]]\{1,\}' mytext.txt > output.txt
grep -oE '[^[:space:]]+' mytext.txt > output.txt

Or, you may extract all chunks of 1+ letters (\p{L}), diacritics (\p{M}) and numbers (\p{N}) with the PCRE regex like:

grep -oP '[\p{L}\p{M}\p{N}]+'  mytext.txt > output.txt

See the online demo. You will need pcregrep on MacOS for this to work.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I use arch linux. Thanks!! All the suggested commands worked, I tested it here. Grateful for suggestions and ideas for testing. – 7beggars_nnnnm Oct 24 '19 at 22:26