0

I'm trying to replace every nth occurrence of a string in a text file.

background: I have a huge bibtex file (called in.bib) containing hundreds of entries beginning with "@". But every entry has a different amount of lines. I want to write a string (e.g. "#") right before every (let's say) 6th occurrence of "@" so, in a second step, I can use csplit to split the huge file at "#" into files containing 5 entries each.

The problem is to find and replace every fifth "@".

Since I need it repeatedly, the suggested answer in printing with sed or awk a line following a matching pattern won't do the job. Again, I do not looking for just one matching place but many of it.

What I have so far:

awk '/^@/ && v++%5 {sub(/^@/, "\n#\n@")} {print > "out.bib"}' in.bib

replaces 2nd until 5th occurance (and no more). (btw, I found and adopted this solution here: "Sed replace every nth occurrence". Initially, it was meant to replace every second occurence--which it does.)

And, second:

awk -v p="@" -v n="5" '$0~p{i++}i==n{sub(/^@/, "\n#\n@")}{print > "out.bib"}' in.bib

replaces exactly the 5th occurance and nothing else. (adopted solution from here: "Display only the n'th match of grep"

What I need (and not able to write) is imho a loop. Would a for loop do the job? Something like:

for (i = 1; i <= 200; i * 5)
   <find "@"> and <replace with "\n#\n@">
then print

The material I have looks like this:

@article{karamanic_jedno_2007,
    title = {Jedno Kosova, Dva Srbije},
    journal = {Ulaznica: Journal for Culture, Art and Social Issues},
    author = {Karamanic, Slobodan},
    year = {2007}
}

@inproceedings{blome_eigene_2008,
    title = {Das Eigene, das Andere und ihre Vermischung. Zur Rolle von Sexualität und Reproduktion im Rassendiskurs des 19. Jahrhunderts},
    comment = {Rest of lines snippet off here for usability -- as in following entries. All original entries may have a different amount of lines.}
}

@book{doring_inter-agency_2008,
    title = {Inter-agency coordination in United Nations peacebuilding}
}

@book{reckwitz_subjekt_2008,
    address = {Bielefeld},
    title = {Subjekt}
}

What I want is every sixth entry looking like this:

#
@book{reckwitz_subjekt_2008,
    address = {Bielefeld},
    title = {Subjekt}
}

Thanks for your help.

jakr
  • 3
  • 1
  • 3
  • Have you looked at http://stackoverflow.com/a/17914105/1745001? If that doesn't provide the answer, [edit] your question to include concise, testable, sample input and expected output and we can help you. – Ed Morton Jul 28 '16 at 14:54
  • Thanks, but the answer provided does not solve the problem above. Edited my question to make things clearer. – jakr Jul 29 '16 at 07:07

3 Answers3

0

Your code is almost right, i modified it.

To replace every nth occurrence, you need a modular expression.

So for better understanding with brackets, you need an expression like ((i % n) == 0)

awk -v p="@" -v n="5" ' $0~p { i++ } ((i%n)==0) { sub(/^@/, "\n#\n@") }{ print }' in.bib > out.bib
sozkul
  • 665
  • 4
  • 10
  • 1
    yes, you are right, i updated answer. – sozkul Jul 28 '16 at 19:23
  • This does exactly what I want -- great and thanks a lot, @sozkul ! Could you explain the magic in it? As I see, it is different in just in the string "i%n==0" from the earlier suggested "i==n". What exactly does it do? Did not really understand. – jakr Jul 29 '16 at 07:12
0

you can do the splitting in awk easily in one step.

awk -v RS='@' 'NR==1{next} (NR-1)%5==1{c++} {print RT $0 > FILENAME"."c}' file

will create file.1, file.2, etc with 5 records each, where the record is defined by the delimiter @.

karakfa
  • 66,216
  • 7
  • 41
  • 56
  • You should mention that's gawk-specific due to `RT` and non-parenthesized right side of output redirection. – Ed Morton Jul 28 '16 at 14:56
  • Thanks. Your approach sounds even better, but I do not get any resulting file (no error message either). Using gawk 4.1.3. – jakr Jul 29 '16 at 07:27
0

Instead of doing this in multiple steps with multiple tools, just do something like:

awk '/@/ && (++v%5)==1{out="out"++c} {print > out}' file

Untested since you didn't provide any sample input/output.

If you don't have GNU awk and your input file is huge you'll need to add a close(out) right before the out=... to avoid having too many files open simultaneously.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Thanks. Awk says: 1. syntax error at the equality sign in string `out="out"++c` when I prepend the suggested `close(out)`. 2. (FILENAME=in.bib FNR=1) Fatal: Der Ausdruck für eine Umlenkung mittels »>« ist ein leerer String Which reads like: The expression for a redirect with ">" is an empty string. ? I use gawk 4.1.3. – jakr Jul 29 '16 at 07:36
  • You made a mistake copy/pasting the script as the script I posted **will not** produce a syntax error in any awk. If you edit your question to include the script you ran and the error it produced we can hep you debug it. The input you've now shared with us makes this a far simpler problem since awk has a specific RS to use when records are separated by blank lines but I don't understand why you're still focuses on prepending `#` signs to the records with awk to prepare for calling split later when you've been shown that awk can simply split the file instead. – Ed Morton Jul 29 '16 at 13:26
  • Hej thanks a lot, @ed-morton for your constant endeavor in this problem! I very appreciate it. I did not not say you did it wrong -- I just can't get it work. This is very likely because I am not as deep in the awk programming as I should to understand what you suggested. I understand it just on a very, very basic level. Therefore I still do not get your suggestion running -- therefore I stuck to that csplit-solution. That works for me, because I can handle it. This is my very first contact with awk and I just wanted my problem to be solved even if on a dirty way. Thumbs up an thanx! – jakr Jul 30 '16 at 21:36