DevsLogics: Apache Spark with large data sets - Possible problems and solutions

Sunday, February 28, 2016

Apache Spark with large data sets - Possible problems and solutions

Issue	Workaround
Driver OOM while using reduceByKey	decrease `spark.default.parallelism`
Java killed by OOM killer on a slave node	change formula in `/root/spark-ec2/deploy_templates.py` on master node: `spark_mb = system_ram_mb * 4 // 5`
SPARK-4808 Spark fails to spill with small number of large objects	update Spark to 1.4.1
SPARK-5077 Map output statuses can still exceed spark.akka.frameSize	`set("spark.akka.frameSize", "128")`
SPARK-6246 spark-ec2 can't handle clusters with > 100 nodes	apply the patch for deploy script
SPARK-6497 Class is not registered: scala.reflect.ManifestFactory$$anon$9	don’t force Kryo class registration
HADOOP-6254 s3n fails with SocketTimeoutException	use S3A file system in Hadoop 2.7.1
S3 HEAD request failed for ... - ResponseCode=403, ResponseMessage=Forbidden	the same
gzip input files are not splittable. (We need to save storage space by archiving datasets. The default choice gzip format doesn’t allow random read access. This is not a bug, but it greatly increases the probability of failing with other issues and degrades performance.)	use bzip2 compression for input files
HADOOP-7823 port HADOOP-4012 to branch-1 (splitting support for bzip2)	update to Hadoop 2.7.1
HADOOP-10614 CBZip2InputStream is not threadsafe	the same
HADOOP-10400 Incorporate new S3A FileSystem implementation	the same
HADOOP-11571 Über-jira: S3a stabilisation phase I	the same
SPARK-6668 repeated asking to remove non-existent executor	install Hadoop 2.7.1 native libraries
SPARK-5348 s3a:// protocol and hadoop-aws dependency	build Spark with patch
Stack Overflow - How to access s3a:// files from Apache Spark?	`--conf spark.hadoop.fs.s3a.access.key=...` `--conf spark.hadoop.fs.s3a.secret.key=...`
HADOOP-9565 Add a Blobstore interface to add to blobstore FileSystems	use `DirectOutputCommitter` (see here)
Timeout waiting for connection from pool	`conf.setInt("fs.s3a.connection.maximum", 100)`

Courtecy : http://tech.grammarly.com/blog/posts/Petabyte-Scale-Text-Processing-with-Spark.html#sthash.GsFi4n0G.dpuf

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)