Saturday, 9 April 2016

Apache Spark RDD

RDD - Resilient Distributed Dataset

RDD forms the basic entity of spark. There are 4 characteristics of RDD:

1. Fault Tolerance - it means that RDD can recompile itself in case of a node failiure. Each RDD has a lineage using which it can recompute itself. This lineage stores the initial dataset and all the transformations to be applied on it. Following example of map function can describe in detail how lineage is created  inherently.

2. Distributed - Spark is a distributed computing framework and so is RDD. RDD comprises of a array of partitions which in turn are representation of data in memory on differs nodes. Computations are applied in these partitions.

3. Cacheable- we can cache/ persist RDD if we choose to. This makes the program quite efficient and we can reuse the the same RDD again if needed by reading from the persisted RDD.

4. Immutable- Immutability is the basic concept of BigData. This  is quite logical too, Actually no one would ever want to change the nth line in a 1 TB data.  Also immutable data is the perfect candidate for caching. RDDs are immutable, that means whenever we perform any transformation on a RDD, it creates a new RDD.

5. Type-Inferred - This is something which comes from functional programming. It means that compiler can throw the semantic error during compilation time only. We do not have to worry about the data type, spark can automatically infer it from the underlying data.