Friday, February 26, 2016

Spark Join and number of partitions

As we all know, number of partitions plays an important role in Apache Spark RDD.
We may need to pre-calculate the number of partitions we are expecting after RDD operations.
We can change partition number using coalesce.

But what will happen when we do RDD Join operation ? Because we are joining two different RDDs, what will be the number of partitions of the result ?

The answer is,

The number depends on `spark.sql.shuffle.partitions`. You can set it for customize it. The default value will be 200. 

Property NameDefaultMeaning

spark.sql.shuffle.partitions200Configures the number of partitions to use when shuffling data for joins or aggregations.

No comments:

Post a Comment

Total Pageviews