Friday, February 26, 2016

Spark Join and number of partitions

As we all know, number of partitions plays an important role in Apache Spark RDD.
We may need to pre-calculate the number of partitions we are expecting after RDD operations.
We can change partition number using coalesce.

But what will happen when we do RDD Join operation ? Because we are joining two different RDDs, what will be the number of partitions of the result ?

The answer is,



The number depends on `spark.sql.shuffle.partitions`. You can set it for customize it. The default value will be 200. 


Property NameDefaultMeaning


spark.sql.shuffle.partitions200Configures the number of partitions to use when shuffling data for joins or aggregations.



http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options

No comments:

Post a Comment

Total Pageviews

Translate