1

So, I am using a regular expression to search through a bunch of files from a corpus. The point is to find the titles of newspaper articles.

This is what I use:

cat *.txt | grep -P '(^[A-ZÖÄÜÕŠŽ].*[^\.]$)' --colour 

It finds lines that begin with a capital, followed by any character, but not ending with a dot and that works for these specific files.

The problem is that two files interfere with each other and the dot from the very end of one file shows up in the beginning of another and I get this:

Kõik Kataria jüngrid kinnitavad , et nende elu on pärast naeruklubiga liitumist oluliselt paranenud .Kosmosepall teeb maailmareisi 39 kilomeetri kõrgusel.

Is there any way to prevent that interference without actually modifying the files or a way to change the regular expression, so that this dot at the beginning is excluded? I must say that I am a beginner, I tried to find solutions, but none of them were specific to my case.

Y. Gf
  • 15
  • 4
  • The files probably does not have a newline at the end, so last line of the first file is merged with the first one in the second one. You can try to append newline on the fly: `find *.txt | xargs -I{} sh -c "cat {}; echo ''" | grep ... ` https://stackoverflow.com/a/44675414/580346 – mrzasa Mar 05 '18 at 11:38
  • Thank you very much for the quick answer, I tried that immediately and it solved my problem like magic. The results now show up as they should, I haven't yet got to learning these commands though, so I will research what exactly every part does. Thanks again. – Y. Gf Mar 05 '18 at 11:48
  • Good, I moved it to and answer, you can accept it (green tick next to arrows) if it helps you :) – mrzasa Mar 05 '18 at 12:11

1 Answers1

0

The files probably does not have a newline at the end, so last line of the first file is merged with the first one in the second one.

You can try to append newline on the fly:

find *.txt | xargs -I{} sh -c "cat {}; echo ''" | grep ... grep -P '(^[A-ZÖÄÜÕŠŽ].*[^\.]$)' --colour

Source: https://stackoverflow.com/a/44675414/580346

mrzasa
  • 22,895
  • 11
  • 56
  • 94