0

There is a file:

Mary 
Mary 
Mary 
Mary 
John 
John 
John 
Lucy 
Lucy 
Mark

I need to get

Mary 
Mary 
Mary 
John 
John 
Lucy

I cannot get the lines ordered according to how many times each line is repeated in the text, i.e. the most frequently occurring lines must be listed first.

tes389
  • 9
  • 2

2 Answers2

1

If your file is already sorted (most-frequent words at top, repeated words only in consecutive lines) – your question makes it look like that's the case – you could reformulate your problem to: "Skip a word when it is encountered for the first time". Then a possible (and efficient) awk solution would be:

awk 'prev==$0{print}{prev=$0}'

or if you prefer an approach that looks more familiar if coming from other programming languages:

awk '{if(prev==$0)print;prev=$0}'

Partially working solutions below. I'll keep them for reference, maybe they are helpful to somebody else.

If your file is not too big, you could use awk to count identical lines and then output each group the number of times it occurred, minus 1.

awk '
{ lines[$0]++ }
END {
  for (line in lines) {
    for (i = 1; i < lines[line]; ++i) {
      print line
    }
  }
}
'

Since you mentioned that the most frequent line must come first, you have to sort first:

sort | uniq -c | sort -nr | awk '{count=$1;for(i=1;i<count;++i){$1="";print}}' | cut -c2-

Note that the latter will reformat your lines (e.g. collapsing/squeezing repeated spaces). See Is there a way to completely delete fields in awk, so that extra delimiters do not print?

knittl
  • 246,190
  • 53
  • 318
  • 364
-1

don't sort for no reason :

nawk '_[$-__]--'

gawk '__[$_]++' 
mawk '__[$_]++'
Mary 
Mary 
Mary 
John 
John 
Lucy 

for 1 GB+ files, u can speed things up a bit by preventing FS from splitting unnecessary fields

mawk2 '__[$_]++' FS='\n'

for 100 GB inputs, one idea would be to use parallel to create, say, 10 instances of awk, piping the full 100 GB to each instance, but assigning each of them a particular range to partition on their end

(e.g. instance 4 handle lines beginning with F-Q, etc), but instead of outputting it all THEN attempt to sort the monstrosity, what one could do is simply have them tally up, and only print out a frequency report of how many copies ("Nx") of each unique line ("Lx") has been recorded.

From there one could sort a much smaller file along the column holding the Lx's, THEN pipe it to one more awk that would print out Nx# copies of each line Lx.

probably a lot faster than trying to sort 100 GB

I created a test scenario by cloning 71 shuffled copies of a raw file with these stats :

 uniq rows = 8125950. | UTF8 chars = 160950688. | bytes = 160950688.

 —- 8.12 mn unique rows spanning 154 MB

……resulting in a 10.6 GB test file :

  in0: 10.6GiB 0:00:30 [ 354MiB/s] [ 354MiB/s] [============>] 100%            
  rows = 576942450. | UTF8 chars = 11427498848. | bytes = 11427498848.

even when using just 1 single instance of awk, it finished filtering the 10.6 GB in ~13.25 mins - reasonable given the fact it's tracking 8.1 mn unique hash keys.

  in0: 10.6GiB 0:13:12 [13.7MiB/s] [13.7MiB/s] [============>] 100%            
 out9: 10.5GiB 0:13:12 [13.6MiB/s] [13.6MiB/s] [<=> ]

 ( pvE 0.1 in0 < testfile.txt | mawk2 '__[$_]++' FS='\n' )

  783.31s user 15.51s system 100% cpu 13:12.78 total


  5e5f8bbee08c088c0c4a78384b3dd328  stdin
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11
  • Care to explain a bit more? What's the difference between `__` and `_`? What's a negative field number (or is `$-` a special variable)? How does it make the most frequent terms go to the top? What's the difference between nawk, gawk, mawk, and mawk2? What does the parallel call look like? Is it portable/POSIX? – knittl Jan 05 '23 at 07:51
  • @knittl : that's exactly what i used `__` in one and `_` in the other - to show that there's absolutely nothing special about either of them, and that it's not a special reserved variable like in `perl` or in a shell. `posix` specs says `"An uninitialized value shall have both a numeric value of zero and a string value of the empty string."`, so as long as `_` remains uninitialized (i.e. explicitly assigned a value), `$_` `$+_` `$-_` all means `$0`. i had to add the extra unary sign there for `nawk` because unlike others, it strangely errors on `$_` ….. – RARE Kpop Manifesto Jan 05 '23 at 09:41
  • Sorry, I don't get it. If it is identical to `$0`, then why not simply use `$0`? Or variable _names_ that actually convey meaning instead of cryptic symbols? – knittl Jan 05 '23 at 09:53
  • @knittl : first of all, let's get one thing straight. there's `posix-portable` and then there's `reality-portable`. `gawk` `mawk/2` `nawk` are just different implementations of `awk` the way `Chrome` `Safari` and `Edge` all just different implementations for an interpreter of `HTML`, `Javascript`, and the whole slate of related web standards. `gnu-parallel` isn't officially spec'ed by `posix` but it's available in a huge variety of systems - practically, in every `Linux`. I don't let `posix` tie my hands behind my back just cuz they're afraid of hurting feelings of long dead OSes like Ultrix – RARE Kpop Manifesto Jan 05 '23 at 09:54
  • @knittl : ….. so one had to get around it via `$-_` —|> `$(-_)` —|> `$(-0)` —|> `$0` — coercing it to numeric `0` without assigning it a value. It's not referencing a negative field at all since `negative zero` still means `zero` - `IEEE754` floating point standards have different representations for `positive zero` and `negative zero`, but that's just a fun peculiarity of a representation approach that decouples the value with its sign -- numerically, `0` simply has no sign because it is both signs simultaneously. – RARE Kpop Manifesto Jan 05 '23 at 09:54
  • @knittl : as for the syntax for `parallel`, it's something similar to how you use `xargs` with both pros and cons — it's far more expressive than `xargs`, but it's also much worse of a resource hog when it comes to using `perl` to spawn jobs. `xargs -0 -n 1 -P 16` becomes `parallel -0 -N 1 -j 16` - plus other minor bits. – RARE Kpop Manifesto Jan 05 '23 at 09:58
  • What is not assigned a number? I have the feeling these explanations make things more confusing. Why `nawk '_[$-__]--'` and not something like `nawk 'lines[$0]--'`? And why decrementing instead of incrementing? This is not codegolf.stackexchange.com, you know … – knittl Jan 05 '23 at 09:59
  • @knittl : what's cryptic about a syntax so simplistic it fits within a couple short pages within `posix` ? The standard explicitly says variable names can be combinations of `ASCII alphanumeric, plus underscore, but cannot begin with a number`. For the love of Allah, I beg you not to use `nawk` unless it's the last resort. Why not `seen[ ]` ? Cuz now you know why `awk` is that much cleaner than `perl` – RARE Kpop Manifesto Jan 05 '23 at 10:09
  • I copied `nawk` from your answer, since you suggested to use it. I don't understand the comparison with perl. What's clearer about `_` that becomes confusing when the variable is called `lines` or `seen`? – knittl Jan 05 '23 at 10:15
  • Could you explain the "plus minor bits" about the usage of `parallel`? It seems like the answer is only half-complete and even the comments don't fully explain how to use the commands ("here's the start of the command, go figure the rest out yourself"). I'm sure that OP will appreciate that – knittl Jan 05 '23 at 10:16
  • Plus, I still don't know why `$_` is required, when `$0` would do just fine (at least to a novice)? Besides that "`_` is numeric zero, because unset". Why not use numeric zero in the first place? – knittl Jan 05 '23 at 10:17
  • @knittl : cuz now you're no longer a novice anymore =) – RARE Kpop Manifesto Jan 05 '23 at 10:23
  • @knittl : OP needs to be a bit more specific about what that 100GB entails before I could even have a sense on how to partition it by `regex` instead of merely by rows or by blocks of bytes. It doesn't have to be uniformly sized partitions at all, but at least some semblance to evenly distributing the workload across the jobs instead of heavily skewed due to poor partitioning criteria – RARE Kpop Manifesto Jan 05 '23 at 10:25