We have recently started examining our Hadoop storage paradigm: Are DAS and JBODs the only way to do it right?
While these methods enable Hadoop to autonomously manage the HDFS and stick to data locality, there are also some disadvantages.
For exampe - what about backups, snapshots, DR and floor space consumption?
We encountered two rival NAS solutions that claim to be the most suitable for Hadoop implementation: NetApp, and EMC Isilon (both cooperate with Cloudera, which is the distribution we use).
Usually the counter argument to these solutions is "what about data locality and IO bottlenecks?", but according to Hadoop's roadmap - we anyway are about to get 1.2-1.5 replication factor based on Erasure Coding in
Hadoop 3.x. How will that maintain better data-locality than Isilon's 1.2 replication factor (also based on Erasure Coding)?
So - what's the point in having these hard-to-operate, hard-to-backup DAS and JBOD solutions when we can stick to central-storage?
On the other hand, we fear of discovering a premature solution, that causes IO bottlenecks and makes it hard for us, the Hadoop administrators, to control our cluster.
So far, we have reached to the conclusion that Isilon's solution for Hadoop is more mature than NetApp's. Yet, we haven't seen any actual proofs for that, and no-one in Israel is using it (Isilon has happy customers in here, but none of them is using it for Hadoop clusters).
I would like to hear your thoughts and experience with the different solutions.