Questions tagged [cascalog]

Cascalog is a fully-featured data processing and querying library for Clojure. The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer from the Clojure REPL. Cascalog is a replacement for tools like Pig, Hive, and Cascading.

Cascalog operates at a significantly higher level of abstraction than a tool like SQL. More importantly, its tight integration with Clojure gives you the power to use abstraction and composition techniques with your data processing code just like you would with any other code. It's this latter point that sets Cascalog far above any other tool in terms of expressive power.

Easy to install, Cascalog has a five-minute set up on.

Cascalog is hosted on Github

Source

26 questions
7
votes
0 answers

Is datalog more efficient than SQL for column oriented databases?

Both Cascalog and Datomic have chosen to use Datalog (over SQL) as their query engine. Dave Thomas made the claim: Datalog is better than SQL for large queries in small amounts of space. My question is: Is datalog more efficient than SQL for…
hawkeye
  • 34,745
  • 30
  • 150
  • 304
5
votes
4 answers

IllegalArgumentException The bucketName parameter must be specified. com.amazonaws.services.s3.AmazonS3Client.rejectNull

Running a Clojure jar on AWS-EMR cluster using (hfs-textline) and getting: IllegalArgumentException The bucketName parameter must be specified. com.amazonaws.services.s3.AmazonS3Client.rejectNull`.
yonatan
  • 595
  • 1
  • 4
  • 18
5
votes
1 answer

Cascalog Hadoop version support

I notice that the Cascalog getting started guide specifies a version of Hadoop :profiles { :dev {:dependencies [[org.apache.hadoop/hadoop-core "1.0.3"]]}} If my group uses a different version of Hadoop then am I out of luck? More broadly with what…
MRocklin
  • 55,641
  • 23
  • 163
  • 235
4
votes
1 answer

Cascalog deffilterop vs pure clojure

Is there a difference, performance or otherwise, between using a deffilterop and using a purse clojure function? http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html mentions that filtering can be done with…
ajorgensen
  • 4,591
  • 3
  • 17
  • 16
3
votes
2 answers

clojure: parallel processing using multiple computers

i have 500 directories, and 1000 files (each about 3-4k lines) for each directory. i want to run the same clojure program (already written) on each of these files. i have 4 octa-core servers. what is a good way to distribute the processes across…
Pradnyesh Sawant
  • 524
  • 5
  • 12
3
votes
1 answer

Unable to resolve symbol in a predicate in Cascalog

I have this query: (?<- (hfs-textline data-out :sinkmode :replace) [?item1 ?item2] ((hfs-textline data-in) ?line) (data-line? ?line) (filter-out-data (#(vector (s/split % #",")) ?line) :> ?item1 ?item2) …
Anna Pawlicka
  • 757
  • 7
  • 22
3
votes
1 answer

Turning co-occurrence counts into co-occurrence probabilities with cascalog

I have a table of co-occurrence counts stored on s3 (where each row is [key-a, key-b, count]) and I want to produce the co-occurrence probability matrix from it. To do that I need to calculate the sum of the counts for each key-a, and then divide…
bobpoekert
  • 934
  • 1
  • 11
  • 26
2
votes
0 answers

Keeping file name information with Cascalog Tuples

I'm looking for a way of keeping a filename that's associated with the tuples/data that originate from that particular file. I've searched around and found that hfs-wholefile works really well at getting filenames but it then returns a large chunk…
mcgeep
  • 53
  • 6
2
votes
1 answer

Supplying a default value for left outer joins

I was wondering what would be the best way of specifying a default value when doing an outer-join in cascalog for field that could be null. (def example-query (<- [?id ?fname ?lname !days-active] (users :> ?id ?fname ?lname) …
mcgeep
  • 53
  • 6
2
votes
4 answers

Clojure Hadoop - 5 Lines of Cascalog equivalent to 300 lines of PIG?

In this presentation at slides 36 and 37 - the author of Cascalog asserts that given a data set of names and ages like: [name age] that the query to return all the results that are greater than the average age is 300 lines of PIG. Is this a valid…
hawkeye
  • 34,745
  • 30
  • 150
  • 304
1
vote
1 answer

Writing from cascalog to MySQL does not work. How to debug this?

I'm trying to write the result of a cascalog query into a MySQL-Database. For this, I'm using cascading-jdbc and following an example i found here. I'm using cascading-jdbc-core and cascading-jdbc-mysql in version 3.0.0. I'm executing precisely this…
Sh4pe
  • 1,800
  • 1
  • 14
  • 30
1
vote
1 answer

Jcascalog to query thrift data on HDFS

I read the book of Nathan Marz on the lambda architecture. I'm actually making a proof of concept of this solution. I have difficulties to build my Jcascalog query. This is the piece of my thrift schema which interest us : union…
Spierki
  • 241
  • 5
  • 16
1
vote
1 answer

Transposing / pivoting rows to columns in Cascalog?

Let's say I have a set of tuples to be processed by Cascalog, formatted like [Date, Name, Value], e.g. 2014-01-01 Pizza 3 2014-01-01 Hamburger 4 2014-01-01 Cheeseburger 2 2014-01-02 Pizza 1 2014-01-02 Hamburger 2 Given that I…
Christoffer
  • 197
  • 8
1
vote
0 answers

JCascalog/Pail shredding stage works locall,y but not in Hadoop

Following the "Big Data" Lambda Architecture book, I've got an incoming directory full of typed Thift Data objects, with a DataPailStructure defined pail.meta file I take a snapshot of this data: Pail snapshotPail =…
TobyEvans
  • 1,431
  • 2
  • 21
  • 27
1
vote
0 answers

what does "ClassCastException java.lang.Character cannot be cast to clojure.lang.Named" mean?

In a toy cascalog based project, I'm trying to use cascalog.more-taps because it contains some facilities to read and write to/from the filesystem. When loading my namespace I get this error message user=> (use…
user1632812
  • 431
  • 3
  • 16
1
2