1

Hi i'm trying to divide an xml file, which contains item tags. As i have 250 items in a single file, i would like to divide the whole file into 5 smaller files containing 50 items (and their content) each.

What i got from this link Linux script: how to split a text into different files with match pattern

awk '{if ($0 ~ /<item>/) a++} { print > ("NewDirectory"a".xml") }'

However this divided the whole file into 1 file per 1 item. So I need help modifying this statement to split the file into 1 file per 50 items.

Community
  • 1
  • 1
  • 2
    Mind giving a [mcve]? – sjsam Aug 04 '16 at 14:44
  • 1
    if you're trying to recreate properly formed `xml` files, you'll need a lot more code that this. And because `xml` and regular expression can never "play together" without a problem, (even if you can solve this particular problem) you're laying the ground work for disappointment for your mangers at a later date, when you're saddled with an XML problem that is so advanced that must be solved with an xml aware tool. And as @sjsam indicates, your Q needs small sample inputs, expected output, your current code and error messages. – shellter Aug 04 '16 at 14:47
  • 1
    Why "small sample inputs"? If you solve your problem for 1 file with 4 lines creating 2 x 2 line files, you can work it out for your real problem, right? Good luck. – shellter Aug 04 '16 at 14:48
  • @shellter i know, just taking 'small' steps. Not doing this for anyone, just trying to learn awk. – Nikita Maximov Aug 04 '16 at 14:53
  • well if you're just trying to learn `awk`, you'll do better to find another learning project. The road to xml mastery via awk is unpassable. Most (all) unix utilities are designed to process a line of data at a time. `xml` has a very different set of organizing principals, ie ` `, a million nested elements on one line, OR each "element" can be on line by itself or separated by 2-100-n blank lines, are perfectly legal. – shellter Aug 04 '16 at 15:04
  • just click on the `awk` tag at the bottom of your Q, and look thru some of those Qs. http://stackoverflow.com/questions/38765092/moving-average-with-successive-elements-using-awk . is actually well defined and has a well commented answer. . In the future, use the linked example as a model for asking Qs. It always help to include such meta-goals as "I'm just trying to understand X", otherwise people will want to give you a more (unixly) efficient answer . :-) Kudos to you for learning such a great programming language ;-) Good luck. – shellter Aug 04 '16 at 15:08
  • As @shellter points out, `awk` one-liners cannot parse arbitrary valid xml. You should edit the OP to include an assumption such as, e.g., "each line will contain at most one `` tag, and if it does, it also contains the string and the associated closing tag". – Matei David Aug 04 '16 at 15:43

2 Answers2

1

Assuming your original command does what you say it does and you fully understand the issues around trying to parse XML with awk:

awk '/<item>/ && (++a%50 == 1) { ++c } { print > ("NewDirectory"c".xml") }'

You might need to add a close() in there if you have a lot of files open simultaneously and aren't using GNU awk. Just get gawk.

Also, to learn awk read the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Thanks, no, not using GNU awk. What is the difference between awk and gawk? – Nikita Maximov Aug 04 '16 at 15:24
  • 1
    awk is to horse as gawk is to clydesdale. An "awk" is a tool that manipulates text with an implicit read loop and condition/action syntax. There are many awk tools out there (old awk, new awk, the one true awk, mawk, tawk, gawk, OSX awk, /usr/xpg4/bin/awk, etc.) of which GNU awk is the one with the most functionality that is currently supported/available. There is a POSIX standard for awk, so many of the awk variants will do what POSIX defines at a minimum but will also have additional functionality. Some awk variants don't even support POSIX and so should be avoided. Get GNU awk, gawk. – Ed Morton Aug 04 '16 at 15:32
0

Try:

awk '$0~/<item>/' | split -l50 -d - NewDirectory.

Explanations:

  • awk will extract only those lines that contain <item>

  • split will split stdin into files with 50 lines, named NewDirectory.00, NewDirectory.01, etc. See man split for more info.

Matei David
  • 2,322
  • 3
  • 23
  • 36
  • I think the intent is to create files of 50 multi-line item records, not files of just the 50 lines that contain the item start tags. – Ed Morton Aug 04 '16 at 15:11
  • The `awk` filter will print the entire line that passes the test, not just `$1`. – Matei David Aug 04 '16 at 15:13
  • 1
    That's right, that's why I said lines that **contain** the start tags. There's no reason to think the whole item will be on a single line though. – Ed Morton Aug 04 '16 at 15:20
  • Oh, I see. I was just providing a quick-fix for the attempt in the OP. Anything that spans multiple lines will require more elaborate parsing, e.g. [xml parsing in python](http://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python). – Matei David Aug 04 '16 at 15:24
  • Thanks! Yes David is correct, however it was my poorly written question (no examples of desirable outcome or input file), but thanks for the response anyway. – Nikita Maximov Aug 04 '16 at 15:26
  • @EdMorton: Any `awk` one-liner will rely on line structure which is not guaranteed to hold in an arbitrary xml file. E.g., the one you refer to will fail if multiple `` entries appear on the same line. Now, _if_ the `` appears only in `$1`, then either this or that answer works. – Matei David Aug 04 '16 at 15:32
  • Correct but we're not talking about arbitrary XML files here - the OP told us in the question that the problem with what she has is it generates one file per item therefore **her specific input** is formatted such that the solution I wrote will solve her problem whereas what you wrote would only solve her problem if the entire `...` is always all on one line, hence my original comment. – Ed Morton Aug 04 '16 at 15:55