0

I was trying to check the length of second field of a TSV file (hundreds of thousands of lines). However, it runs very very slowly. I guess it should be something wrong with "echo", but not sure how to do.

Input file:

prob    name
1.0     Claire
1.0     Mark
...     ...
0.9     GFGKHJGJGHKGDFUFULFD

So I need to print out what went wrong in the name. I tested with a little example using "head -100" and it worked. But just can't cope with original file.

This is what I ran:

for title in `cat filename | cut -f2`;do
length=`echo -n $line | wc -m`
if [ "$length" -gt 10 ];then
echo $line
fi
done
Luca
  • 59
  • 4

3 Answers3

1

Try this probably:

cat file.tsv | awk '{if (length($2) > 10) print $0;}'

This should be a bit faster since the whole processing is done by the single awk process, while your solution starts 2 processes per loop iteration to make that comparison.

Igor S.K.
  • 999
  • 6
  • 17
1

We can use awk if that helps.

awk '{if(length($2) > 10){print}}' filename

$2 here is 2nd field in filename which runs for every line. It would be faster.

Shravan Yadav
  • 1,297
  • 1
  • 14
  • 26
1

awk to rescue:

awk 'length($2)>10' file

This will print all lines having the second field length longer than 10 characters.

Note that it doesn't require any block statement {...} because if the condition is met, awk will by default print the line.

oliv
  • 12,690
  • 25
  • 45
  • Your script is correct but a few words to explain would be useful. Otherwise, your answer may as well be "`awk` to the rescue, and ask again on Stack Overflow next time you want to do anything". – Tom Fenech Mar 22 '18 at 11:08
  • Thanks! I don't really know about awk, but it seems quite useful. I'll learn about it. – Luca Mar 22 '18 at 16:24