4

I have one txt file which has below data

Name    mobile  url message text
test11  1234567890  www.google.com  "Data Test New
Date:27/02/2020
Items: 1
Total: 3
Regards
ABC DATa
Ph:091 : 123456789"
test12  1234567891  www.google.com  "Data Test New one
Date:17/02/2020
Items: 26
Total: 5
Regards
user test
Ph:091 : 433333333"

Now you can see my last column data has new line character. so when I use below command

awk 'END{print NR}' file.txt

it is giving my length is 15 but actually line length is 3 . Please suggest command for the same

Edited Part: As per the answer given the below script is not working if there's no newline at the end of input file

awk -v RS='"[^"]*"' '{gsub(/\n/, " ", RT); ORS=RT} END{print NR "\n"}' test.txt 

Also my file may have 3-4 Million of records . So converting file to unix format will take time and that is not my preference. So Please suggest some optimum solution which should work in both case

head 5.csv | cat -A  
Above command is giving me the output

Name mobile url message text^M$

anubhava
  • 761,203
  • 64
  • 569
  • 643
user13000875
  • 387
  • 2
  • 14
  • You cannot have a line with a newline character inside it: you automatically create a new line (which is the whole idea). – Dominique Nov 27 '20 at 09:43
  • Does this answer your question? [What's the most robust way to efficiently parse CSV using awk?](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk) – rethab Nov 27 '20 at 09:44
  • So you want to known number of newline characters outside `""`? – Daweo Nov 27 '20 at 09:47
  • @Daweo I want to count number of line inside this file and for given data it should give 3 – user13000875 Nov 27 '20 at 09:48
  • 1
    @user13000875, we understand you want to count number of lines but our question is what is the Logic of counting a multiple(lines separated by new lines) lines as 1 line, kindly do make it clear in your question. – RavinderSingh13 Nov 27 '20 at 09:49
  • Are your columns separated by tabs? – Shawn Nov 27 '20 at 10:52
  • 1
    What you're calling a line isn't a line in POSIX terms because a line is a string that ends in a newline and therefore cannot contain a newline. That's what's confusing everyone. What you should be calling it instead is a record and then you can define what you mean by a record, e.g. a string of text ending in a newline that can contain newlines within quoted fields. – Ed Morton Nov 27 '20 at 15:19

3 Answers3

7

Using gnu-awk you can do this using a custom RS:

awk -v RS='"[^"]*"' '{gsub(/(\r?\n){2,}/, "\n"); n+=gsub(/\n/, "&")}
END {print n}' <(sed '$s/$//' file)

15001

Here:

  • -v RS='"[^"]*"': Uses this regex as input record separator. Which matches a double quoted string
  • n+=gsub(/\n/, "&"): Dummy replace \n with itself and counts \n in variable n
  • END {print n}: Prints n in the end
  • sed '$s/$//' file: For last line adds a newline (in case it is missing)

Code Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/225762/discussion-on-answer-by-anubhava-count-number-of-line-in-txt-file-when-new-line). – Samuel Liew Dec 10 '20 at 08:13
1

With perl, assuming last line always ends with a newline character

$ perl -0777 -nE 'say s/"[^"]+"(*SKIP)(*F)|\n//g' ip.txt
3
  • -0777 to slurp entire input file as a single string, so this isn't suitable if the input file is very large
  • the s command returns number of substitutions made, which is used here to get the count of newlines
  • "[^"]+"(*SKIP)(*F) will cause newlines within double quotes to be ignored

You can use the below command if you want to count the last line even if it doesn't end with newline character.

perl -0777 -nE 'say scalar split /"[^"]+"(*SKIP)(*F)|\n/' ip.txt
Sundeep
  • 23,246
  • 2
  • 28
  • 103
0

Same as anubhava but with GNU sed:

<infile sed '/"/ { :a; N; /"$/!ba; s/\n/ /g; }' | wc -l

Output:

3
Thor
  • 45,082
  • 11
  • 119
  • 130