1

I am using pig to load data from Cassandra using CqlStorage. i have 4 data nodes each can have 7 mappers, there is ~30 million data in Cassandra. When i run like this

LOAD 'cql://keyspace/columnfamily' using CqlStorage it takes 27 mappers to run .

But if i give where clause in the load function like

LOAD 'cql://keyspace/columnfamily?where_clause=id%3D100' using CqlStorage it always takes one mapper.

Can any one help me in increasing mapper

Shri
  • 469
  • 5
  • 18
  • possible duplicate of [Change File Split size in Hadoop](http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop) – reo katoa May 16 '14 at 13:43
  • This is specifically for CqlStorage having where clause, though i have tried changing the split size but still only one mapper. **Note: Only in the case when i include where clause.** – Shri May 16 '14 at 13:52

1 Answers1

0

It looks from your WHERE clause like your map input will only be a single key, which would be the reason why you only get one mapper. Hadoop will allocate mappers based on the number of input keys. If you have only one input key, additional mappers will do nothing.

The bottom line is that if you specify your partition key in the where clause, you will get one mapper (since that's the way it gets distributed). Based on the comments I presume you are doing analysis for more than just one student, so there's no reason you'd be specifying the partition key. You also don't seem to have any columns that make sense for a secondary index. So I'm not sure why you even have a where clause.

It looks from your data model like you'll have to map over all your data to get aggregate marks for a combination of student and time range. It's possible you could change to a time-series data model and successfully filter in the where clause, but your current model doesn't support this.

rs_atl
  • 8,935
  • 1
  • 23
  • 28
  • May be the case. Can you please suggest me which API i can use for doing this, i mean to read data from cassandra by filtering on some condition, so that i wont give much load on pig. – Shri May 20 '14 at 09:23
  • Can you provide some details about what you're trying to do? – rs_atl May 20 '14 at 13:01
  • Thanks for the response.I want to load data from Cassandra and process it through pig and dump it to HDFS. To do this am using CqlStorage. But here i don't want to load all the data from Cassandra, my criteria would be like loading one month or 2 month data, so that i wont give heavy load for pig to process, but having filter(where clause) in CqlStorage looks like not working https://issues.apache.org/jira/browse/CASSANDRA-6151 . So am asking for suggestion on any other alternative solution. – Shri May 21 '14 at 07:08
  • It would be helpful if you post your data model. – rs_atl May 21 '14 at 13:31
  • Schema in C* would be like this `table student( fn,ln,date,time,m1,m2,m3 PK((fn,ln,date),time))`. I want to aggregate/pull the marks for a given student for a given date/time rage. Once i put result into the hdfs i can use `sqoop` to load data to `rdbms`. And my reports will point to rdbms to display the report of a student. – Shri May 22 '14 at 12:09
  • But your where clause references "id". Where is this field in your schema? – rs_atl May 22 '14 at 13:36
  • Am sorry in question i referred "id" just for an example. – Shri May 22 '14 at 13:59
  • See my revised answer above. – rs_atl May 22 '14 at 14:27