2

This should be a simple task ... !

I have a directory with a number of html files. Each one has a div called for a class called crumb. I want to split the file into two on crumb. Later, I'll concatenate the second part of the split file with a new beginning part.

So I tried this, to split all the html files - actually two files called news.html and about.html for the moment - on the pattern crumb:

find *.html -exec csplit - /crumb/ {} \;

But I have this response:

csplit: ‘about.html’: invalid pattern
csplit: ‘news.html’: invalid pattern

Why are the file names are being interpreted as a pattern?

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
AndrewUK
  • 31
  • 3

2 Answers2

1

You can get insight into the problem by adding 'echo'

find *.html -exec echo csplit - /crumb/ {} \;

Which will show

csplit - /crumb/ about.html
csplit - /crumb/ news.html

Running those command interactively produces the error from the question: csplit: ‘about.html’: invalid pattern

Checking with csplit man, it show the usage: 'csplit [OPTION]... FILE PATTERN...', indicating that the first parameter should be the file name, followed by the pattern. The command that is generated from the above script include the file name AFTER the pattern.

Proposed fix:

find *.html -exec  csplit  {} /crumb/ \;

# OR, with unique suffix for every file, and 3 digit suffix
find *.html -exec csplit --prefix {} --suffix-format='%03d' {} /crumb/ \;

Which will execute:

csplit about.html /crumb/
csplit news.html /crumb/

Not possible to tell if this generate the requested output (split the files as needed), as the input files are not provided.

dash-o
  • 13,723
  • 1
  • 10
  • 37
  • Thank-you. That's really helpful, especially the idea of using echo to provide further information. I don't fully understand the {} argument. I had assumed it was a terminator for the -exec but your suggests that it is something else? – AndrewUK Oct 03 '19 at 16:29
  • The find command will execute the named command for each input file, replacing the '{}' with the name of the file. You can use the '{}' to control the location of the input file in the generated command. – dash-o Oct 03 '19 at 16:41
  • Ahhh okay. Thanks. I had thought that the minus character '-' in my original query was standard in, and that this would acheive provide the file contents. But now I understand {} - filename produced by each iteration of find - and that's what's required. Thanks again. – AndrewUK Oct 03 '19 at 16:55
  • Happy to help, and welcome to Stack Overflow. If this answer or any other one solved your issue, please mark it as accepted. – dash-o Oct 03 '19 at 17:31
  • Hi @dash-o Indeed this fix works. Although it's not quite what I'd expected! It over-writes the files xx01 xx00 each iteration so that there's only one set of files. I'd expected that the split files would have unique names thus for each split file then there would be two files. So say I have two .html files in the working directory then I'd have four files resulting: xx00 xx01 xx03 xx04. But that's not what the code does. Have you any suggestions how I might work around this? Thank-you for your help and assistance. Andrew – AndrewUK Oct 04 '19 at 20:27
  • Okay, I think I need to pipe the output to xargs where I rename the xxoo to be the [filename]xx00.hmtl for example. So... | xargs mv xx00 '{}xx00' /; Would that work? – AndrewUK Oct 04 '19 at 21:16
  • Based on csplit documentation, you have two options that can help you: '-f prefix' and '-b suffix'. You can replace the 'xx' prefix with distinct prefixes using 'cscript -b '{}' {} /crumb' – dash-o Oct 05 '19 at 04:43
  • When using find *.html -exec cscript -b '{}' {} /crumb/ \; then I have this error given: find: ‘cscript’: No such file or directory. But twice as there are two html files. – AndrewUK Oct 05 '19 at 09:53
  • By the way, if you think I should post this as a new question then do tell me as I am new here! Thanks. – AndrewUK Oct 05 '19 at 09:53
  • Oops - typo 'csplit -b '{}' {} /crumb' and not 'cscript -b ...' – dash-o Oct 05 '19 at 10:48
  • I think we're getting closer ... find *.html -exec csplit -b '{}' {} /crumb \; csplit: missing % conversion specification in suffix csplit: missing % conversion specification in suffix – AndrewUK Oct 06 '19 at 13:12
  • find *.html -exec csplit -b --suffix-format='%d03{}' {} /crumb/ \; this gives me some output but it's not quite right!! – AndrewUK Oct 06 '19 at 13:15
  • ls gives me: about.htmtl , news.html , 'xx--suffix-format=003about.html' , news.html 'xx--suffix-format=003news.html' , 'xx--suffix-format=103about.html' , 'xx--suffix-format=103news.html' – AndrewUK Oct 06 '19 at 13:16
  • The prefix flag (-b) need the name of the input ('{}') pass in: 'find *.html -exec csplit -b {} --suffix-format='%d03{}' {} /crumb/' – dash-o Oct 06 '19 at 16:30
  • This works: find *.html -exec csplit -b {} --suffix-format='%d03{}' {} /crumb/ \; (includes terminating characters) – AndrewUK Oct 06 '19 at 18:30
  • Now I have these files after executing the above: about.html , news.html , xx003about.html , xx103about.html , xx003news.html xx103news.html – AndrewUK Oct 06 '19 at 18:37
  • Happy to help, and welcome to Stack Overflow. If this answer or any other one solved your issue, please mark it as accepted. – dash-o Oct 06 '19 at 18:39
  • @AndrewUK I updated the csplit command to use proper flags (--suffix-format, --prefix). – dash-o Oct 07 '19 at 04:38
0

The synopsis of the csplit command is

csplit [OPTION]... FILE PATTERN...

but you use

csplit - PATTERN FILE

where - is "read from standard input" (instead of a file), and then FILE is interpreted as a pattern. Instead:

find -name '*.html' -exec csplit {} /crumb/ \;

Notice that *.html should be single quoted, or the shell expands it before find sees it.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
  • Thank-you Benjamin. I hadn't understood that order of expansion, and the importance of the single quotes. – AndrewUK Oct 03 '19 at 16:33
  • Hi @Benjamin W. I tried this with the single quoted *.html and the results which I had was: No such file or directory. By the way, this was after I'd also applied the fix below. – AndrewUK Oct 04 '19 at 20:31
  • @AndrewUK Did you use `find '*.html'`? Notice the difference between the way you did it in your question, `find *.html` (must not be quoted!), and the way I did it, `find -name '*.html'` (should be quoted). – Benjamin W. Oct 04 '19 at 23:54
  • `find *.html` will only act on HTML files in the current directory, and you could just do something like `for f in ./*.html; do csplit "$f" /crumb/; done` instead. `find -name '*.html'` will run on any `.html` file in the directory tree. – Benjamin W. Oct 04 '19 at 23:56
  • Apologies, I should have said that all the html files are in the current directory. Thank-you for the suggestion re. recursive find though. Could you advise me (as a beginner in Stack Overflow.) if I should raise another question - in regard to this - or continue 'here' with an elaboration of my question? – AndrewUK Oct 05 '19 at 09:29
  • @andrewuk If the elaboration is complex enough to stand on its own, I'd ask a new (but self-contained) question. If it's just a detail, I'm happy to answer in a comment. Thanks for checking! – Benjamin W. Oct 06 '19 at 03:22
  • Hi Benjamin W., This works find with your correction. I have two files as output - xx00 and xx01 with the first file being the text before the /crumb/ and the second the text after it. Thank-you. But I actually have many html files and the consequence of this command sequence is that only the last html file is split. What I thought I could do is to pipe the output to cat where I can combine the second part of the html file - the part after crumb with a new first part (a file called newfp.txt) and then write it to a subdirectory with the same name as the original file. Does that make any sense? – AndrewUK Oct 06 '19 at 13:48
  • @AndrewUK In that case, I'd loop instead of trying to contort `find -exec` into doing it; something like `for f in *.html; do csplit "$f" /crumb/; mkdir "${f%.html}"; cat newfp.txt xx01 > "${f%.html}/$f"; done` – Benjamin W. Oct 06 '19 at 21:08
  • Is that correct? The code seems to create a directory (mkdir) with the name of the file - ie lots of directories. Shouldn't creating the output directory be outside of the loop? Thanks. – AndrewUK Oct 08 '19 at 14:26
  • Oh yes, I see the confusion - my description above says 'and then write it to a subdirectory with the same name as the original file'. I mean't the output file had the same name rather than the output directory having the same name as the file! For clarity, call the output directory outputdir or something similar! – AndrewUK Oct 08 '19 at 14:29
  • How about?: mkdir outputdir; for f in *.html; do csplit "$f" /crumb/; cat newfp.txt xx01 > outputdir/"$f" ; – AndrewUK Oct 08 '19 at 14:34