Resilient Distributed Datasets (RDD)

DAG (Direct Acyclic Graphs) – Used to track dependencies
– DAG is used to track dependencies of each partition of each RDD

Actions: collect, take, reduce, saveAsTextFile
-A transformation is from worker nodes to worker nodes. An action is between worker nodes and the Driver
-A transformation is lazy, an action instead executes immediately.

Memory Cache
-By Default each job processes from HDFS
-Mark RDD with .cache()

Broadcast Variables
-Copy a large lookup table to all worker nodes
-Copy a large configuration dictionary to all worker nodes.
-Copy a small/medium sized RDD for a join

Accumulators: Common pattern of accumulating to a variable across the cluster.
invalid = sc.accumulator(0)


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s