0

I want to split one big text document (.txt) into multiple ones. The text document is a bunch of debates in the Spanish parliament. The text is divided into policy initiatives (I'm not sure if that is idiomatic) and I want to split it into a document per initiative. The funny thing is that each initiative has its own title in the next form:

- DEL GRUPO PARLAMENTARIO CATALÁN (CONVERGÈNCIA I UNIÓ), REGULADORA DE LOS HORARIOS COMERCIALES. (Número de expediente 122/000004.)

- DEL DIPUTADO DON MARIANO RAJOY BREY, DEL GRUPO PARLAMENTARIO POPULAR EN EL CONGRESO, QUE FORMULA AL SEÑOR PRESIDENTE DEL GOBIERNO: ¿CÓMO VALORA USTED LOS PRIMEROS DÍAS DE SU GOBIERNO? (Número de expediente 180/000021.)

As you can see, every title is in upper case, it starts with a minus and ends with "XXX/XXXXXX.)" (where X is a digit), a dot and a close parenthesis. Every title is different from each other. I have though making some RegEx to capture those characteristics in order to have a delimiter element between those debate.

The ideal would be to select the title and the debate below it until another title appears and make a new document with that, so in the end I can have in a single document the policy initiative with its title and its own debate. Thanks to this community I've got a functional script:

awk '/^-.+[0-9]{3}\/[0-9]{6}\.\)$/ {
        if (p) close (p)
        p = sprintf("split%05i.txt", ++i) }
    { if (p) print > "p" }' inputfile.txt

But when I run it (with Cygwin in W10) nothing happens. I thought it was due to a Windows configuration problem or something like that, but I just tried in a Ubuntu VM and same happens, i.e., nothing happens:

$ ls -l
total 228
-rw-rw-r-- 1 ubuntu ubuntu 219166 Jan 30 11:28 tryme.txt
-rwxr-xr-x 1 ubuntu ubuntu   8259 Jan 30 11:24 ubiquity.desktop

$ awk '/^-.+[0-9]{3}\/[0-9]{6}\.\)$/ {
        if (p) close (p)
        p = sprintf("split%05i.txt", ++i) }
    { if (p) print > "p" }' tryme.txt

$ ls -l
total 228
-rw-rw-r-- 1 ubuntu ubuntu 219166 Jan 30 11:28 tryme.txt
-rwxr-xr-x 1 ubuntu ubuntu   8259 Jan 30 11:24 ubiquity.desktop

Any idea about what is happening here? Thank you very much.

Descartes
  • 69
  • 9
  • 2
    Run `cat -v` on your input file. Now see https://stackoverflow.com/questions/45772525/why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it and do what it says to get rid of those trailing `^M`s if present. Did that solve your problem? If not - do you have other white space at the end of the title lines you're trying to match? print to `p`, not `"p"` btw. – Ed Morton Jan 30 '19 at 13:08
  • 1
    As you mention windows, there might be `CRLF` endings. So your regular expression is not matching. There might also be extra blanks. You might attempt the following test: `/^-/ && /[0-9]{3}\/[0-9]{6}\.\)[[:blank:]]*\r?$/` – kvantour Jan 30 '19 at 16:21
  • Oh yeah, that was the problem, the Windows text format. Now it works flawlessly. Thank you very much guys, I really appreciate it! Cheers! – Descartes Feb 02 '19 at 12:15

0 Answers0