0

I want to run grep on HTML files to find out lines longer than x characters and truncate the display using grep.

What I know

To figure out lines longer than 100 characters in html files.

find . -name '*.html' -print | xargs grep -on '.\{100\}'

To find lines matching title and limit display by 40 characters with

find . -name '*.html' -print | xargs grep -onE '.{0,40}title.{0,40}'

What I don't know

How can I find out lines that exceed 100 characters and then display those lines by limited to 40 characters?


MVCE

I have a bunch of html files, which look like

$ cat 1.html
abcdefghijklmnopqrstuv12345675689
12345675689abcdefghijklmnopqrstuv
abcd1234

Now, I'd like to find out lines longer than 20 characters, and then cut the display to 15 characters only.

Expected output with favoretti solution

$ find . -name '*.html' -print | xargs grep -on '.\{20\}' | cut -c -15
./1.html:1:abcd
./1.html:2:1234

./2.html:1:abcd
./2.html:2:1234
piroot
  • 712
  • 5
  • 9
  • 1
    This is definitely possible with Awk, although I'm not sure what your exact requirement is - a [mcve] would help. – Tom Fenech Dec 21 '17 at 08:48
  • 2
    Almost, I'm not sure if the "Expected output" you posted is exactly what you want. – Tom Fenech Dec 21 '17 at 08:58
  • @TomFenech +++ Or if the expected output the correct then please edit the question, becuse the expected output is different about what you asked... – clt60 Dec 21 '17 at 08:59
  • Also, based on your most recent comment below the current answer, it seems like there are probably some line breaks in the HTML (which of course, is perfectly valid, and a suggestion that you might be better off using a tool that understands HTML). – Tom Fenech Dec 21 '17 at 09:00
  • I've updated the example, for the example, there are two identical html files. Expected output is right, criteria is find lines greater than 20 characters and then while displaying limit the line to 15 characters. Although I'd have liked 15 chracters without inlcuding filename and line number, but that would do. – piroot Dec 21 '17 at 09:02
  • @piroot wrt `truncate the display using grep` you're confused about what grep is for - grep is to find the string matching a regexp and print that string (`g/re/p`). It is not for modifying the found string before printing it - if you need to do that then you need sed or awk or some non-standard-UNIX tool. – Ed Morton Dec 21 '17 at 10:09
  • Can you [edit] your question clarify what your expected output is? Is it 15 characters including a file name and line number or a file name plus 15 characters of the matching string? Do you even want line numbers to be output? I (and others) thought you wanted "filename:15 chars of string" but now your expected output is "15 chars of filename:line number:string". – Ed Morton Dec 21 '17 at 14:42

3 Answers3

4

First of all it's worth mentioning that unless you're very confident that you can treat your "HTML" files as a series of line-separated records, then you should probably be using an HTML-aware tool (either standalone, or included in a scripting language).

Since you mentioned Awk in an earlier comment:

find . -name '*.html' -exec awk '
    length($0) > 20 { print FILENAME, substr($0, 1, 15) }' {} +

This matches lines with length greater than 20 and prints the first 15 characters. I put the file name at the start, you can remove that if you like.

It's not clear whether you need find for a recursive search or not - if not, then you might be fine with letting the shell generate the list of files:

awk 'length($0) > 20 { print FILENAME, substr($0, 1, 15) }' *.html

And with globstar enabled (shopt -s globstar), you can use **/*.html for recursive matching in Bash.

Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
  • Hmm, looking at the OPs expected output - she seems to want the truncation to include the FILENAME, so maybe add a `print substr(FILENAME":"$0, 1, 15)` case. – Ed Morton Dec 21 '17 at 10:12
  • 1
    @Ed yes, although then in a comment we have "Although I'd have liked 15 chracters without inlcuding filename and line number", so I don't know! – Tom Fenech Dec 21 '17 at 10:15
  • Yeah I don't know any more either so I added a comment asking for clarification from the OP. – Ed Morton Dec 21 '17 at 14:43
2

If for some reason you want to just use grep

find . -name '*.html' -exec grep -oP '.{40}(?=.{60})' {} /dev/null \;
123
  • 10,778
  • 2
  • 22
  • 45
  • It's also possible it'll core dump or have other results which is why the man page describes `-P` as "highly experimental" and that option has been removed from some greps. – Ed Morton Dec 21 '17 at 14:44
  • @EdMorton Used it for years, never had or seen reported problems from it's use (other than a few bugs which you get with anything).I can 100% guarantee no unexpected results my command. Which greps has it been removed from? – 123 Dec 21 '17 at 15:16
  • OSX grep is one I'd heard about. There's various examples in SO of `grep -P` core dumping but it's hard to search for or I'd provide links. Personally I'd just avoid anything that the providers themselves describe as "highly experimental". Not just experimental - **highly** experimental! – Ed Morton Dec 21 '17 at 15:27
  • Here's someone talking about `-P` being removed from OSX grep: https://stackoverflow.com/q/16658333/1745001. Here's a core dump report about GNU grep -P: https://lists.gnu.org/archive/html/bug-grep/2006-01/msg00014.html. Here's another: https://bugzilla.redhat.com/show_bug.cgi?id=1167766. And another: https://cygwin.com/ml/cygwin/2014-01/msg00248.html. The points are: a) `-P` may or may not work today (with different results depending on your input file contents!) and b) `-P` may or may not be removed in the next version of the tool! – Ed Morton Dec 21 '17 at 15:36
  • @EdMorton Reading the linked question it appears -P wasn't removed they just switched the grep in OSX from GNU to an old BSD version, so I don't think the flag has been removed from the actual version of grep, and I find it unlikely it will ever be removed from GNU grep. Bugs happen with everything ¯\\_(ツ)_/¯ – 123 Dec 21 '17 at 15:46
  • Right, people in that thread were guessing about why OSX replaced shiny new GNU grep with old BSD grep - I would not be surprised if the bug reports they were getting had something to do with it. Right again, bugs happen with everything but they're far more likely to happen with something the provider tells you is **highly experimental** than with anything else. And at the end of the day you simply never NEED to use `grep -P` so what's the point of using it? – Ed Morton Dec 21 '17 at 15:56
0

The first grep works ok I suppose, so if you want to print out just 40 chars, pipe it through cut?

find . -name '*.html' -print | grep -on '.\{100\}' | cut -c 1-40
favoretti
  • 29,299
  • 4
  • 48
  • 61
  • Thanks, that was helpful. I was wondering if there was a inbuilt grep way without needing to pipe `cut`? – piroot Dec 21 '17 at 08:40
  • Don't think so, you'll either end up piping through another `grep` that will do `-o '.\{40\}'` or use `cut`.. You could potentially use awk to do the processing without find, but that's yet another completely different solution. – favoretti Dec 21 '17 at 08:44
  • 1
    BTW, `grep -n` will give you matching line number in output, so the 40 chars will be not of the line itself, but also including the line number and a `:` symbol. – favoretti Dec 21 '17 at 08:45