.NET and Hadoop - What should I know / learn and what is available?

Question

Information

My question is regarding BigData in .NET. BigData is used to store and query huge amounts of data (Facebook, Google, Twitter, ...). Examples of BigData are MapReduce, Hadoop, Dryad, etc.

Microsoft dropped their Dryad (DryadLinq) alternative in favor of Hadoop (Dryad and the article), so I'd like to prepare myself for it and everything that has to do with it.

What I already know

What is available now?

Hadoop Connector

SQL Server 2012 RC (don't use in production :))

Microsoft Information on Big Data

What should I know more about releases and development?

Register on the TechPreview

Questions

Question 1: What should I know about Hadoop that isn't unique to the .NET platform? (how to query, specific patterns, architecture, ...) and will be usefull (in a .NET environment)

Question 2: Is there more information on the Hadoop in the .NET platform, than I already know?

"BigData is used to store and query huge ammounts of data" --> That's where Hadoop confuses everyone. Big Data is ability to (a) Run computation in parallel, and (b) Run computation against large amount of data. Hadoop in addition to farming out calculation to nodes (Job/Task Tracker), it also persists data on HDFS (Hadoop Distributed File System). HDFS is why Hadoop claims throne to scalability, as many firms has grid which scales perfect in terms of farming out calculation to nodes, but bottleneck on database tier. Get around this bottleneck? database clustering. — Swab.Jat, Feb 07 '14 at 09:41

score 10 · Accepted Answer · edited Dec 19 '11 at 20:11

it's a vague question so here's a vague answer :)

Hadoop on its own is a tool to run map-reduce jobs in a cluster, it's highly optimized for performance and a good deal of this optimization is done by distributing the data in a way that makes it easy to consume without incurring on I/O penalties.

for this you should read about HDFS and the internals that explain how is this done, in a nutshell what happens is that the input data is clumped together in nodes to run the processes locally and read sequentially (this is a property/limitation of HDFS).

this way you input your "BigData" and it gets split and processed in the most efficient way inside the cluster.

now that' all there is to Hadoop itself, there's tools that work on top of it that allow you to perform high-level abstractions on the data (map-reduce is among the simplest procedures).

those include:

Pig http://pig.apache.org/ which is a language to work with the map-reduce process and construct more complex operations
Hive http://hive.apache.org/ similar to the previous but more SQL-oriented
Cascading http://www.cascading.org/ yet another, more focused on data flow than queries
Cascalog https://github.com/nathanmarz/cascalog based on Cascading, written in Clojure
HBase http://hbase.apache.org/ a type of NoSQL database on top of HDFS
ElephantDB https://github.com/nathanmarz/elephantdb another NoSQL database for Hadoop

Specifics for .Net

For Hadoop on Azure (.Net) , there's an introduction on msdn here with more info here. Related to building Hadoop applications through their platform. It's only CTP for now, but off course this will change.

Here's another good blogpost about Hadoop and MapReduce with code

Additionally, there's also a company that frequently gives information about Hadoop: Cloudera, you should check there frequently for more information. For more information, check the cloudera page linked above and you can view all the concepts about Hadoop (it's pretty advanced though)

I'm pretty sure this isn't what you were looking for but I've no idea what you want so at least I hope you can check a few new projects that may help.

also check Storm: https://github.com/nathanmarz/storm it's not related to Hadoop but works on realtime scenarios which Hadoop is not suited for.

score 1 · Answer 2 · edited Mar 28 '12 at 08:29

At the moment, there is not much .NET specific stuff for Hadoop. You just follow the regular Hadoop tutorials. SQL Server connector allows you just to import input data and export results to a format that is easier to access for the remainder of your application.

You can run Hadoop on Windows. However, it requires Cygwin(a Unix-like environment and command-line interface for Microsoft Windows).

Basically, to use Hadoop you will need to learn Linux anyway.

.NET and Hadoop - What should I know / learn and what is available?

Information

What I already know

Questions

2 Answers2

Linked