Resilient Distributed Datasets (RDD)

DAG (Direct Acyclic Graphs) – Used to track dependencies
– DAG is used to track dependencies of each partition of each RDD

Actions: collect, take, reduce, saveAsTextFile
-A transformation is from worker nodes to worker nodes. An action is between worker nodes and the Driver
-A transformation is lazy, an action instead executes immediately.

Memory Cache
-By Default each job processes from HDFS
-Mark RDD with .cache()

Broadcast Variables
-Copy a large lookup table to all worker nodes
-Copy a large configuration dictionary to all worker nodes.
-Copy a small/medium sized RDD for a join

Accumulators: Common pattern of accumulating to a variable across the cluster.
invalid = sc.accumulator(0)