Spark

Resilient Distributed Datasets (RDD)
http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html

DAG (Direct Acyclic Graphs) – Used to track dependencies
– DAG is used to track dependencies of each partition of each RDD

Actions: collect, take, reduce, saveAsTextFile
-A transformation is from worker nodes to worker nodes. An action is between worker nodes and the Driver
-A transformation is lazy, an action instead executes immediately.

Memory Cache
-By Default each job processes from HDFS
-Mark RDD with .cache()
-Lazy

Broadcast Variables
-Copy a large lookup table to all worker nodes
-Copy a large configuration dictionary to all worker nodes.
-Copy a small/medium sized RDD for a join

Accumulators: Common pattern of accumulating to a variable across the cluster.
example:
invalid = sc.accumulator(0)
invalid.add(1);

Advertisements