Sunday, February 28, 2016

Apache Spark with large data sets - Possible problems and solutions

Driver OOM while using reduceByKeydecrease spark.default.parallelism
Java killed by OOM killer on a slave nodechange formula in /root/spark-ec2/ on master node:
spark_mb = system_ram_mb * 4 // 5
SPARK-4808 Spark fails to spill with small number of large objectsupdate Spark to 1.4.1
SPARK-5077 Map output statuses can still exceed spark.akka.frameSizeset("spark.akka.frameSize", "128")
SPARK-6246 spark-ec2 can't handle clusters with > 100 nodesapply the patch for deploy script
SPARK-6497 Class is not registered: scala.reflect.ManifestFactory$$anon$9don’t force Kryo class registration
HADOOP-6254 s3n fails with SocketTimeoutExceptionuse S3A file system in Hadoop 2.7.1
S3 HEAD request failed for ... - ResponseCode=403, ResponseMessage=Forbiddenthe same
gzip input files are not splittable. (We need to save storage space by archiving datasets. The default choice gzip format doesn’t allow random read access. This is not a bug, but it greatly increases the probability of failing with other issues and degrades performance.)use bzip2 compression for input files
HADOOP-7823 port HADOOP-4012 to branch-1 (splitting support for bzip2)update to Hadoop 2.7.1
HADOOP-10614 CBZip2InputStream is not threadsafethe same
HADOOP-10400 Incorporate new S3A FileSystem implementationthe same
HADOOP-11571 Über-jira: S3a stabilisation phase Ithe same
SPARK-6668 repeated asking to remove non-existent executorinstall Hadoop 2.7.1 native libraries
SPARK-5348 s3a:// protocol and hadoop-aws dependencybuild Spark with patch
Stack Overflow - How to access s3a:// files from Apache Spark?--conf spark.hadoop.fs.s3a.access.key=...
--conf spark.hadoop.fs.s3a.secret.key=...
HADOOP-9565 Add a Blobstore interface to add to blobstore FileSystemsuse DirectOutputCommitter (see here)
Timeout waiting for connection from poolconf.setInt("fs.s3a.connection.maximum", 100)

Courtecy :

No comments:

Post a Comment

Total Pageviews