1

I am new to prolog and am considering using it for a small data analysis application. Here is what I am seeking to accomplish:

I have a CSV file with some data of the following from:

a,b,c
d,e,f
g,h,i
...

The data is purely numerical and I need to do the following: 1st, I need to group rows according to the following scheme:

enter image description here

So what's going on above?

I start at the 1st row, which has value 'a' in column one. Then, I keep going down the rows until I hit a row whose value in column one differs from 'a' by a certain amount, 'z'. The process is then repeated, and many "groups" are formed after the process is complete.

For each of these groups, I want to find the mean of columns two and three (as an example, for the 1st group in the picture above, the mean of column two would be: (b+e+h)/3).

I am pretty sure this can be done in prolog. However, I have 50,000+ rows of data and since prolog is declarative, I am not sure how efficient prolog would be at accomplishing the above task?

Is it feasible to work out a prolog program to accomplish the above task, so that efficiency of the program is not significantly lower than a procedural analog?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
killajoule
  • 3,612
  • 7
  • 30
  • 36
  • See http://stackoverflow.com/questions/12939794/stack-overflow-in-prolog-dcg-grammar-rule-how-to-handle-large-lists-efficiently/12942551#12942551 – false Nov 01 '13 at 02:54
  • OP reports that it took him 5-15 secs to process 5-10MB of data. I am okay with this level of efficiency (the 50,000+ rows I have take up only about 6 MB). However, would the time scale linearly with the file size? Or rather, would it be possible to devise code in prolog so that time would scale linearly with file size (it should, because I just need to go through all rows once to accomplish what I am seeking). – killajoule Nov 01 '13 at 03:20
  • 1
    Your concern of scaling was the motivation to create `library(pio)`. Code can now run side effect free and with space overheads proportional to the number of open choice points only. Typically, for a deterministic parsing this means constant space overheads. – false Nov 01 '13 at 03:38
  • This http://stackoverflow.com/questions/8108049/how-do-i-read-in-a-text-file-and-print-them-out-to-a-file-in-prolog/8108173#8108173 runs in constant space so `grep` is just as space efficient as the procedural counterpart. – false Nov 01 '13 at 03:49

1 Answers1

1

this snippet could be a starting point for your task

:- [library(dcg/basics)].

rownum(Z, AveList) :- phrase_from_file(row_scan(Z, [], [], AveList), 'numbers.txt').

row_scan(Z, Group, AveSoFar, AveList) -->
    number(A),",",number(B),",",number(C),"\n",
    { row_match(Z, A,B,C, Group,AveSoFar, Group1,AveUpdated) },
    row_scan(Z, Group1, AveUpdated, AveList).
row_scan(_Z, _Group, AveList, AveList) --> "\n";[].

% row_match(Z, A,B,C, Group,Ave, Group1,Ave1) 
row_match(_, A,B,C, [],Ave, [(A,B,C)],Ave).
row_match(Z, A,B,C, [H|T],Ave, Group1,Ave1) :-
    H = (F,_,_),
    (  A - F =:= Z
    -> aggregate_all(agg(count,sum(C2),sum(C3)),
        member((_,C2,C3), [(A,B,C), H|T]), agg(Count,T2,T3)),
       A2 is T2/Count, A3 is T3/Count,
       Group1 = [], Ave1 = [(A2,A3)|Ave]
    ;  Group1 = [H,(A,B,C)|T], Ave1 = Ave
    ).

with this input

1,2,3
4,5,6
7,8,9
10,2,3
40,5,6
70,8,9
16,0,0

yields

?- rownum(6,L).
L = [ (3.75, 4.5), (5, 6)] 
CapelliC
  • 59,646
  • 5
  • 47
  • 90