DevsLogics: February 2016

Sunday, February 28, 2016

Apache Spark with large data sets - Possible problems and solutions

Issue	Workaround
Driver OOM while using reduceByKey	decrease `spark.default.parallelism`
Java killed by OOM killer on a slave node	change formula in `/root/spark-ec2/deploy_templates.py` on master node: `spark_mb = system_ram_mb * 4 // 5`
SPARK-4808 Spark fails to spill with small number of large objects	update Spark to 1.4.1
SPARK-5077 Map output statuses can still exceed spark.akka.frameSize	`set("spark.akka.frameSize", "128")`
SPARK-6246 spark-ec2 can't handle clusters with > 100 nodes	apply the patch for deploy script
SPARK-6497 Class is not registered: scala.reflect.ManifestFactory$$anon$9	don’t force Kryo class registration
HADOOP-6254 s3n fails with SocketTimeoutException	use S3A file system in Hadoop 2.7.1
S3 HEAD request failed for ... - ResponseCode=403, ResponseMessage=Forbidden	the same
gzip input files are not splittable. (We need to save storage space by archiving datasets. The default choice gzip format doesn’t allow random read access. This is not a bug, but it greatly increases the probability of failing with other issues and degrades performance.)	use bzip2 compression for input files
HADOOP-7823 port HADOOP-4012 to branch-1 (splitting support for bzip2)	update to Hadoop 2.7.1
HADOOP-10614 CBZip2InputStream is not threadsafe	the same
HADOOP-10400 Incorporate new S3A FileSystem implementation	the same
HADOOP-11571 Über-jira: S3a stabilisation phase I	the same
SPARK-6668 repeated asking to remove non-existent executor	install Hadoop 2.7.1 native libraries
SPARK-5348 s3a:// protocol and hadoop-aws dependency	build Spark with patch
Stack Overflow - How to access s3a:// files from Apache Spark?	`--conf spark.hadoop.fs.s3a.access.key=...` `--conf spark.hadoop.fs.s3a.secret.key=...`
HADOOP-9565 Add a Blobstore interface to add to blobstore FileSystems	use `DirectOutputCommitter` (see here)
Timeout waiting for connection from pool	`conf.setInt("fs.s3a.connection.maximum", 100)`

Courtecy : http://tech.grammarly.com/blog/posts/Petabyte-Scale-Text-Processing-with-Spark.html#sthash.GsFi4n0G.dpuf

Friday, February 26, 2016

Spark Join and number of partitions

As we all know, number of partitions plays an important role in Apache Spark RDD.
We may need to pre-calculate the number of partitions we are expecting after RDD operations.
We can change partition number using coalesce.

But what will happen when we do RDD Join operation ? Because we are joining two different RDDs, what will be the number of partitions of the result ?

The answer is,

The number depends on `spark.sql.shuffle.partitions`. You can set it for customize it. The default value will be 200.

Property Name	Default	Meaning

`spark.sql.shuffle.partitions`	200	Configures the number of partitions to use when shuffling data for joins or aggregations.

http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options

Wednesday, February 24, 2016

world's first open source memory-centric distributed storage system Alluxio (formerly known as tachyon) released v1.0 on Feb 23 2016

Alluxio is the world's first open source memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster jobs, possibly written in different computation frameworks, such as Apache Spark, Apache MapReduce, and Apache Flink. In the big data ecosystem, Alluxio lies between computation frameworks or jobs, such as Apache Spark, Apache MapReduce, or Apache Flink, and various kinds of storage systems, such as Amazon S3, OpenStack Swift, GlusterFS, HDFS, Ceph, or OSS.

"Today, we are very excited to announce the 1.0 release of Alluxio, the world’s first memory-centric virtual distributed storage system, which unifies data access and bridges computation frameworks and underlying storage systems. Applications only need to connect with Alluxio to access data stored in any underlying storage systems. Additionally, Alluxio’s memory-centric architecture enables data access orders of magnitude faster than existing solutions.
Now, organizations can run any computation framework (e.g. Apache Spark, Apache MapReduce, Apache Flink, etc.) with any storage system (e.g. Alibaba OSS, Amazon S3, OpenStack Swift, GlusterFS, Ceph, etc.), leveraging any storage media (e.g. DRAM, SSD, HDD, etc.).", Haoyuan Li, CEO of Alluxio wrote this on his blog.

This graph will show how people are involved in this project comparing with other open source projects.

Get ready for tackle Big data with Alluxio in "tachyon" speed. :)

http://www.alluxio.org/

I already wrote some blog about tachyon in past days.

http://devslogics.blogspot.com/2015/12/tachyon-cluster-configuration-setup.html
http://devslogics.blogspot.in/2015/12/tachyon-was-not-formatted.html