Split large text file using AWK, given specific parameters

Question

Hi i'm trying to divide an xml file, which contains item tags. As i have 250 items in a single file, i would like to divide the whole file into 5 smaller files containing 50 items (and their content) each.

What i got from this link Linux script: how to split a text into different files with match pattern

awk '{if ($0 ~ /<item>/) a++} { print > ("NewDirectory"a".xml") }'

However this divided the whole file into 1 file per 1 item. So I need help modifying this statement to split the file into 1 file per 50 items.

if you're trying to recreate properly formed `xml` files, you'll need a lot more code that this. And because `xml` and regular expression can never "play together" without a problem, (even if you can solve this particular problem) you're laying the ground work for disappointment for your mangers at a later date, when you're saddled with an XML problem that is so advanced that must be solved with an xml aware tool. And as @sjsam indicates, your Q needs small sample inputs, expected output, your current code and error messages. — shellter, Aug 04 '16 at 14:47
Why "small sample inputs"? If you solve your problem for 1 file with 4 lines creating 2 x 2 line files, you can work it out for your real problem, right? Good luck. — shellter, Aug 04 '16 at 14:48
@shellter i know, just taking 'small' steps. Not doing this for anyone, just trying to learn awk. — Nikita Maximov, Aug 04 '16 at 14:53
well if you're just trying to learn `awk`, you'll do better to find another learning project. The road to xml mastery via awk is unpassable. Most (all) unix utilities are designed to process a line of data at a time. `xml` has a very different set of organizing principals, ie ` `, a million nested elements on one line, OR each "element" can be on line by itself or separated by 2-100-n blank lines, are perfectly legal. — shellter, Aug 04 '16 at 15:04
just click on the `awk` tag at the bottom of your Q, and look thru some of those Qs. http://stackoverflow.com/questions/38765092/moving-average-with-successive-elements-using-awk . is actually well defined and has a well commented answer. . In the future, use the linked example as a model for asking Qs. It always help to include such meta-goals as "I'm just trying to understand X", otherwise people will want to give you a more (unixly) efficient answer . :-) Kudos to you for learning such a great programming language ;-) Good luck. — shellter, Aug 04 '16 at 15:08
As @shellter points out, `awk` one-liners cannot parse arbitrary valid xml. You should edit the OP to include an assumption such as, e.g., "each line will contain at most one `` tag, and if it does, it also contains the string and the associated closing tag". — Matei David, Aug 04 '16 at 15:43

Ed Morton · Accepted Answer · 2016-08-04T15:24:28.637

1

Assuming your original command does what you say it does and you fully understand the issues around trying to parse XML with awk:

awk '/<item>/ && (++a%50 == 1) { ++c } { print > ("NewDirectory"c".xml") }'

You might need to add a close() in there if you have a lot of files open simultaneously and aren't using GNU awk. Just get gawk.

Also, to learn awk read the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

edited Aug 04 '16 at 15:24

answered Aug 04 '16 at 15:07

Ed Morton

188,023
17
78
185

Thanks, no, not using GNU awk. What is the difference between awk and gawk? – Nikita Maximov Aug 04 '16 at 15:24
1

awk is to horse as gawk is to clydesdale. An "awk" is a tool that manipulates text with an implicit read loop and condition/action syntax. There are many awk tools out there (old awk, new awk, the one true awk, mawk, tawk, gawk, OSX awk, /usr/xpg4/bin/awk, etc.) of which GNU awk is the one with the most functionality that is currently supported/available. There is a POSIX standard for awk, so many of the awk variants will do what POSIX defines at a minimum but will also have additional functionality. Some awk variants don't even support POSIX and so should be avoided. Get GNU awk, gawk. – Ed Morton Aug 04 '16 at 15:32

Matei David · Answer 2 · 2016-08-04T15:36:09.480

0

Try:

awk '$0~/<item>/' | split -l50 -d - NewDirectory.

Explanations:

awk will extract only those lines that contain <item>
split will split stdin into files with 50 lines, named NewDirectory.00, NewDirectory.01, etc. See man split for more info.

edited Aug 04 '16 at 15:36

answered Aug 04 '16 at 14:57

Matei David

2,322
3
23
36

I think the intent is to create files of 50 multi-line item records, not files of just the 50 lines that contain the item start tags. – Ed Morton Aug 04 '16 at 15:11
The `awk` filter will print the entire line that passes the test, not just `$1`. – Matei David Aug 04 '16 at 15:13
1

That's right, that's why I said lines that **contain** the start tags. There's no reason to think the whole item will be on a single line though. – Ed Morton Aug 04 '16 at 15:20
Oh, I see. I was just providing a quick-fix for the attempt in the OP. Anything that spans multiple lines will require more elaborate parsing, e.g. [xml parsing in python](http://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python). – Matei David Aug 04 '16 at 15:24
Thanks! Yes David is correct, however it was my poorly written question (no examples of desirable outcome or input file), but thanks for the response anyway. – Nikita Maximov Aug 04 '16 at 15:26
@EdMorton: Any `awk` one-liner will rely on line structure which is not guaranteed to hold in an arbitrary xml file. E.g., the one you refer to will fail if multiple `` entries appear on the same line. Now, _if_ the `` appears only in `$1`, then either this or that answer works. – Matei David Aug 04 '16 at 15:32
Correct but we're not talking about arbitrary XML files here - the OP told us in the question that the problem with what she has is it generates one file per item therefore **her specific input** is formatted such that the solution I wrote will solve her problem whereas what you wrote would only solve her problem if the entire `...` is always all on one line, hence my original comment. – Ed Morton Aug 04 '16 at 15:55

Split large text file using AWK, given specific parameters

2 Answers2