1

I have .srt files which are in the following format:

0
1
00:00:01,830 --> 00:00:04,740
corresponding text
1

2
00:00:05,280 --> 00:00:10,280
corresponding text
2

3
00:00:10,740 --> 00:00:14,640
corresponding text
3

4
00:00:15,510 --> 00:00:19,260
corresponding text
4

and that extra line with the line number is all the way through the subtitle (line 5, line 6...line 540). I tried the command sed '/^[0-9]/ s/.//' and as expected it replaces all the numbers, but I don't know how to make it replace only the second occurrence of each number in the range.

The expected result is:

0
1
00:00:01,830 --> 00:00:04,740
corresponding text

2
00:00:05,280 --> 00:00:10,280
corresponding text

3
00:00:10,740 --> 00:00:14,640
corresponding text

4
00:00:15,510 --> 00:00:19,260
corresponding text

How can I achieve it either with sed, awk or any tool that can do it in batches since there are several files with the same situation?

Thanks!

3 Answers3

4
$ awk 'BEGIN{FS=OFS=RS;RS=""} {$NF=""}1' file
0
1
00:00:01,830 --> 00:00:04,740
corresponding text

2
00:00:05,280 --> 00:00:10,280
corresponding text

3
00:00:10,740 --> 00:00:14,640
corresponding text

4
00:00:15,510 --> 00:00:19,260
corresponding text
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Huh??? WTF? Ed... ?? What on earth is happening here?? (clueless, but I gotta know...) Why default print everything with a rule `$NF=""`? – David C. Rankin Jan 28 '21 at 03:27
  • 2
    @RavinderSingh13 - yes, that makes sense. I had snapped to most of that but did not catch the effect of `$NF=""` -- thinking about it in terms of paragraph and nuking the last field makes perfect sense. – David C. Rankin Jan 28 '21 at 06:15
  • Right, it's just emptying the last line (field) of each block (record/paragraph) then printing the block. – Ed Morton Jan 28 '21 at 12:03
  • I don't know why, but it doesn't work for me. It only removes the last duplicate of the file – jota jota Lopez Jan 28 '21 at 22:34
  • See https://stackoverflow.com/questions/45772525/why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it for what's causing that and how to fix it. – Ed Morton Jan 29 '21 at 00:04
2

Using awk, you can set a variable whenever the line contains one field. If it does, use a variable to hold the last value of that field, and skip printing the line when they match.

awk 'NF == 1 {if (num != "" && $0 == num) next; else num = $0} 1'
Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Thanks! Works great. I would really appreciate an explanation of how it works EDIT: For some reason 1 to 4 works fine, 5 remain doubled, and from 6 to the end it works fine again – jota jota Lopez Jan 28 '21 at 22:37
2

direct translation of your description. Remove the duplicate number appearing standalone of the line. Print if not integer, otherwise print only the first instance.

$ awk 'int($0)!=$0 || !a[$0]++' file

0
1
00:00:01,830 --> 00:00:04,740
corresponding text

2
00:00:05,280 --> 00:00:10,280
corresponding text

3
00:00:10,740 --> 00:00:14,640
corresponding text

4
00:00:15,510 --> 00:00:19,260
corresponding text
karakfa
  • 66,216
  • 7
  • 41
  • 56
  • I have the same problem as with the most voted answer, it only removes the last duplicate of the file and i dont know why – jota jota Lopez Jan 28 '21 at 22:35
  • Can you test with the sample input file you posted at the question (copy from the question please)? Then we'll know whether it's the scripts or your actual input file is the culprit. – karakfa Jan 28 '21 at 22:37
  • Apparently it's the file...when I check the file with ```file``` it says ```ASCII text, with CRLF line terminators``` – jota jota Lopez Jan 28 '21 at 22:45
  • Use `dos2unix` to remove the CR characters, Unix text files should just use LF as the line terminator. – Barmar Jan 28 '21 at 23:48