Combining Lists of Word Frequency Data

Question

This seems like it should be an obvious question, but the tutorials and documentation on lists are not forthcoming. Many of these issues stem from the sheer size of my text files (hundreds of MB) and my attempts to boil them down to something manageable by my system. As a result, I'm doing my work in segments and am now trying to combine the results.

I have multiple word frequency lists (~40 of them). The lists can either be taken through Import[ ] or as variables generated in Mathematica. Each list appears as the following and has been generated using the Tally[ ] and Sort[ ] commands:

{{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 16850}, {"in",
16164}, {"de", 14930}, {"a", 14660}, {"to", 14175}, {"la", 7347}, {"was", 6030}, {"l", 5981}, {"le", 5735}, <<51293>>, {"abattoir", 1}, {"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

Here is an example of the second file:

{{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 16262}, {"and",
14488}, {"to", 12726}, {"a", 12635}, {"in", 11141}, {"la", 10739}, {"et", 9016}, {"les", 8675}, {"le", 7748}, <<101032>>, {"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

I want to combine them so that the frequency data aggregates: i.e. if the second file has 30,419 occurrences of 'the' and is joined to the first file, it should return that there are 72,635 occurrences (and so on as I move through the entire collection).

A closely related question: http://stackoverflow.com/questions/5143575/aggregating-tally-counters — Mr.Wizard, Oct 24 '11 at 19:04
Also somewhat related: http://stackoverflow.com/questions/7749633/time-efficient-partial-inverted-index-building/ — Leonid Shifrin, Oct 24 '11 at 22:28

Szabolcs · Accepted Answer · 2011-10-24T14:24:36.583

10

It sounds like you need GatherBy.

Suppose your two lists are named data1 and data2, then use

{#[[1, 1]], Total[#[[All, 2]]]} & /@ GatherBy[Join[data1, data2], First]

This easily generalizes to any number of lists, not just two.

edited Oct 24 '11 at 14:24

answered Oct 24 '11 at 13:57

Szabolcs

24,728
9
85
174

score 8 · Answer 2 · answered Oct 24 '11 at 13:36

Try using a hash table, like this. First set things up:

ClearAll[freq];
freq[_] = 0;

Now eg freq["safas"] returns 0. Next, if the lists are defined as

lst1 = {{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 
    16850}, {"in", 16164}, {"de", 14930}, {"a", 14660}, {"to", 
    14175}, {"la", 7347}, {"was", 6030}, {"l", 5981}, {"le", 
    5735}, {"abattoir", 1}, {"abattement", 1}, {"abattagen", 
    1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 
    1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 
    1}, {"aaa", 1}};
lst2 = {{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 
    16262}, {"and", 14488}, {"to", 12726}, {"a", 12635}, {"in", 
    11141}, {"la", 10739}, {"et", 9016}, {"les", 8675}, {"le", 
    7748}, {"abattement", 1}, {"abattagen", 1}, {"abattage", 
    1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 
    1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}};

you may run this

Scan[(freq[#[[1]]] += #[[2]]) &, lst1]

after which eg

freq["the"]
(*
42216
*)

and then the next list

Scan[(freq[#[[1]]] += #[[2]]) &, lst2]

after which eg

freq["the"]
72635

while still

freq["safas"]
(*
0
*)

This works really quickly! Is there a way, however, to output the list again with the final results - i.e. an aggregate list of all the terms (i.e. {{"the",72635},{"of",41165}...) [can be in any format] — canadian_scholar, Oct 24 '11 at 13:49
@ian.milligan try eg this http://stackoverflow.com/questions/7165169/picking-specific-symbol-definitions-in-mathematica-not-transformation-rules/7169185#7169185 — acl, Oct 24 '11 at 14:24

Mr.Wizard · Answer 3 · 2011-10-24T19:51:11.010

8

Here is a direct Sow/Reap function:

Reap[#2~Sow~# & @@@ data1~Join~data2;, _, {#, Tr@#2} &][[2]]

Here is a concise form of acl's method:

Module[{c},
  c[_] = 0;

  c[#] += #2 & @@@ data1~Join~data2;

  {#[[1, 1]], #2} & @@@ Most@DownValues@c
]

This appears to be a bit faster than Szabolcs code on my system:

data1 ~Join~ data2 ~GatherBy~ First /.
  {{{x_, a_}, {x_, b_}} :> {x, a + b}, {x : {_, _}} :> x}

edited Oct 24 '11 at 19:51

answered Oct 24 '11 at 19:06

Mr.Wizard

24,179
5
44
125

Also, the compact form of acl's method is particularly clever. I do have one question, though, in the `Reap` implementation, why the semi-colon after the `Sow` statement? – rcollyer Oct 25 '11 at 16:08
@rcollyer that's a good question. It's an old habit, but maybe not a good one. It visually reminds me that I am not using the direct result of that expression and I once thought it was more efficient, but that does not seem to be the case. It does have the advantage of suppressing a large output, which can be very slow to format even with the Skeleton display, if I forget the `[[2]]` after `Reap`. I should consider using `Scan` or `Do` instead. – Mr.Wizard Oct 25 '11 at 18:51

score 6 · Answer 4 · edited May 23 '17 at 11:44

6

There's an old saying, "if all you have is a hammer, everything becomes a nail." So, here's my hammer: SelectEquivalents.

This can be done a little quicker using SelectEquivalents:

SelectEquivalents[data1~Join~data2, #[[1]]&, #[[2]]&, {#1, Total[#2]}&]

In order, the first param is obviously just the joined lists, the second one is what they're grouped by (in this case the first element), the third param strips off the string leaving just the count, and the fourth param puts it back together with the string as #1 and the counts in a list as #2.

edited May 23 '17 at 11:44

Community

1
1

answered Oct 24 '11 at 14:52

rcollyer

10,475
4
48
75

@ian.milligan, also check out [Faysal's variant](http://stackoverflow.com/questions/4198961/what-is-in-your-mathematica-tool-bag/6245166#6245166) of `SelectEquivalents`. I, personally, wouldn't make everything an `Option`, but his variant is extremely flexible. And, both versions are more than capable of running rings around `GatherBy`. – rcollyer Oct 24 '11 at 15:34

DavidC · Answer 5 · 2011-10-24T13:44:42.230

3

Try ReplaceRepeated.

Join the lists. Then use

//. {{f1___, {a_, c1_}, f2___, {a_, c2_}, f3___} -> {f1, f2, f3, {a, c1 + c2}}}

edited Oct 24 '11 at 13:44

answered Oct 24 '11 at 13:24

DavidC

3,056
1
20
30

I'm sure there are faster ways. In my edit I placed {a,c1+c2} at the end of the rule's output, to save a bit of time. – DavidC Oct 24 '11 at 13:53
Though conceptually interesting, this is very slow. – Mr.Wizard Oct 24 '11 at 18:59

Combining Lists of Word Frequency Data

5 Answers5

Linked