Friday, February 26, 2016

Spark Join and number of partitions

As we all know, number of partitions plays an important role in Apache Spark RDD.
We may need to pre-calculate the number of partitions we are expecting after RDD operations.
We can change partition number using coalesce.

But what will happen when we do RDD Join operation ? Because we are joining two different RDDs, what will be the number of partitions of the result ?

The answer is,



The number depends on `spark.sql.shuffle.partitions`. You can set it for customize it. The default value will be 200. 


Property NameDefaultMeaning


spark.sql.shuffle.partitions200Configures the number of partitions to use when shuffling data for joins or aggregations.



http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options

3 comments:

  1. Your writing style is captivating and your content is always engaging – keep it up!Navigating this minecraft factions servers
    is a delight, offering a seamless and satisfying gaming experience for players of all levels.

    ReplyDelete
  2. Your writing style is both engaging and informative. Thank you for making complex topics accessible and interesting. Find out how fast you can respond by taking our Reaction Time Test. It’s a simple and effective way to measure your speed and work on getting faster.

    ReplyDelete
  3. Thank you for consistently offering inspiration. Your blog helps me stay committed to achieving my personal and professional goals. Discover additional insights in this article. Jump into Geometry Dash Free and tackle fast-paced levels filled with vibrant obstacles and catchy soundtracks.

    ReplyDelete