10

Assume I have two lists, one is the text t, one is a list of characters c. I want to count how many times each character appears in the text.

This can be done easily with the following APL code.

+⌿t∘.=c

However it is slow. It take the outer product, then sum each column.

It is a O(nm) algorithm where n and m are the size of t and c.

Of course I can write a procedural program in APL that read t character by character and solve this problem in O(n+m) (assume perfect hashing).

Are there ways to do this faster in APL without loops(or conditional)? I also accept solutions in J.

Edit: Practically speaking, I'm doing this where the text is much shorter than the list of characters(the characters are non-ascii). I'm considering where text have length of 20 and character list have length in the thousands.

There is a simple optimization given n is smaller than m.

w  ← (∪t)∩c
f ←  +⌿t∘.=w
r ← (⍴c)⍴0
r[c⍳w] ← f
r

w contains only the characters in t, therefore the table size only depend on t and not c. This algorithm runs in O(n^2+m log m). Where m log m is the time for doing the intersection operation.

However, a sub-quadratic algorithm is still preferred just in case someone gave a huge text file.

Chao Xu
  • 2,156
  • 2
  • 22
  • 31
  • I can't see how it can be O(n+m)... Even if there was a J verb dedicated to do just that, wouldn't it be O(nm)? – MPelletier Aug 10 '11 at 13:56
  • Assume we have this operation addCount(k). This add 1 to the counter that record how many times character k appeared. If this operation can be done in constant time (it's possible if there is a perfect hash function), then we use O(n+m) time. Even if we don't have such hash function, we can still use a binary tree to get a O(n log m) algorithm. – Chao Xu Aug 10 '11 at 15:00
  • It's an interesting case, to say the least. Do you have figures on how much data you're using (size of t and c), and how long it takes? – MPelletier Aug 10 '11 at 15:26
  • Indeed, I added an update to reflect the practical problem I'm trying to solve. – Chao Xu Aug 11 '11 at 03:05
  • Thousands of characters in c? Are those Unicode characters? – MPelletier Aug 11 '11 at 03:09
  • Yes. Of course this question can be generalized to find the frequency of elements other than characters, and the only change is switch = to ≡. – Chao Xu Aug 11 '11 at 03:23
  • Is it possible that t contains characters not present in c? For example could it be that `t = 'abcd1234'` and `c = 'abcdefgh'`? – Eelvex Aug 13 '11 at 14:57

7 Answers7

8

NB. Using "key" (/.) adverb w/tally (#) verb counts

   #/.~ 'abdaaa'
4 1 1

NB. the items counted are the nub of the string.

   ~. 'abdaaa'
abd

NB. So, if we count the target along with the string

   #/.~ 'abc','abdaaa'
5 2 1 1

NB. We get an extra one for each of the target items.

   countKey2=: 4 : '<:(#x){.#/.~ x,y'

NB. This subtracts 1 (<:) from each count of the xs.

   6!:2 '''1'' countKey2 10000000$''1234567890'''
0.0451088
   6!:2 '''1'' countKey2 1e7$''1234567890'''
0.0441849
   6!:2 '''1'' countKey2 1e8$''1234567890'''
0.466857

NB. A tacit version

   countKey=. [: <: ([: # [) {. [: #/.~ ,

NB. appears to be a little faster at first

   6!:2 '''1'' countKey 1e8$''1234567890'''
0.432938

NB. But repeating the timing 10 times shows they are the same.

   (10) 6!:2 '''1'' countKey 1e8$''1234567890'''
0.43914
   (10) 6!:2 '''1'' countKey2 1e8$''1234567890'''
0.43964
DevonMcC
  • 351
  • 3
  • 3
7

Dyalog v14 introduced the key operator ():

      {⍺,⍴⍵}⌸'abcracadabra'
a 5
b 2
c 2
r 2
d 1

The operand function takes a letter as and the occurrences of that letter (vector of indices) as .

John Stewart
  • 81
  • 1
  • 2
5

I think this example, written in J, fits your request. The character list is longer than the text (but both are kept short for convenience during development.) I have not examined timing but my intuition is that it will be fast. The tallying is done only with reference to characters that actually occur in the text, and the long character set is looked across only to correlate characters that occur in the text.

   c=: 80{.43}.a.
   t=: 'some {text} to examine'

   RawIndicies=: c i. ~.t
   Mask=: RawIndicies ~: #c
   Indicies=: Mask # RawIndicies
   Tallies=: Mask # #/.~ t
   Result=: Tallies Indicies} (#c)$0

   4 20 $ Result
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 4 0
0 0 1 0 0 0 2 1 2 0 0 0 1 3 0 0 0 2 0 0
   4 20 $ c
+,-./0123456789:;<=>
?@ABCDEFGHIJKLMNOPQR
STUVWXYZ[\]^_`abcdef
ghijklmnopqrstuvwxyz
kaleidic
  • 3,070
  • 1
  • 21
  • 22
1

As noted in other answers, the key operator does this directly. However the classic APL way of solving this problem is still worth knowing.

The classic solution is "sort, shift, and compare":

      c←'missippi'
      t←'abcdefghijklmnopqrstuvwxyz'
      g←⍋c
      g
1 4 7 0 5 6 2 3
      s←c[g]
      s
iiimppss
      b←s≠¯1⌽s
      b
1 0 0 1 1 0 1 0
      n←b/⍳⍴b
      n
0 3 4 6
      k←(1↓n,⍴b)-n
      k
3 1 2 2
      u←b/s
      u
imps

And for the final answer:

       z←(⍴t)⍴0
       z
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
       z[t⍳u]←k
       z
0 0 0 0 0 0 0 0 3 0 0 0 1 0 0 2 0 0 2 0 0 0 0 0 0 0

This code is off the top of my head, not ready for production. Have to look for empty cases - the boolean shift is probably not right for all cases....

Paul Mansour
  • 1,436
  • 9
  • 11
0

My implementation in APL (NARS2000):

(∪w),[0.5]∪⍦w←t∩c

Example:

      c←'abcdefg'
      t←'abdaaaerbfqeiurbouebjkvwek'
      (∪w),[0.5]∪⍦w←t∩c
a b d e f
4 4 1 4 1

Note: showing only those characters in c that exist in t

ManAndPC
  • 41
  • 2
0

My initial thought was that this was a case for the Find operator:

T←'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
C←'MISSISSIPPI'
X←+/¨T⍷¨⊂C

The used characters are:

       (×X)/T
IMPS

Their respective frequencies are:

       X~0
4 1 2 4

I've only run toy cases so I have no idea what the performance is, but my intuition tells me it should be cheaper that the outer product. Any thoughts?

mappo
  • 444
  • 2
  • 9
  • I just saw that I flipped the variables C and T from the OP, but the logic is the same. – mappo Apr 24 '15 at 12:08
0

"Brute force" in J:

count =: (i.~~.) ({,&0) (]+/"1@:=)

Usage:

   'abc' count 'abdaaa'
4 1 0

Not sure how it's implemented internally, but here are the timings for different input sizes:

   6!:2 '''abcdefg'' count 100000$''abdaaaerbfqeiurbouebjkvwek''' NB: run time for #t = 100000 
0.00803909
   6!:2 '''abcdefg'' count 1000000$''abdaaaerbfqeiurbouebjkvwek'''
0.0845451
   6!:2 '''abcdefg'' count 10000000$''abdaaaerbfqeiurbouebjkvwek''' NB: and for #t = 10^7
0.862423

We don't filter input date prior to 'self-classify' so:

   6!:2 '''1'' count 10000000$''1'''
0.244975
   6!:2 '''1'' count 10000000$''1234567890'''
0.673034
   6!:2 '''1234567890'' count 10000000$''1234567890'''
0.673864
Ed'ka
  • 6,595
  • 29
  • 30