DevsLogics: December 2015

Thursday, December 17, 2015

Tachyon Cluster Configuration Setup Manual

Tachyon is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. It achieves high performance by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, thereby avoiding going to disk to load datasets that are frequently read. This enables different jobs/queries and frameworks to access cached files at memory speed.

Tachyon Cluster Configuration Setup Manual

In Master Node

wget http://tachyon-project.org/downloads/files/0.7.1/tachyon-0.7.1-bin.tar.gz
tar -xvzf tachyon-0.7.1-bin.tar.gz
cd tachyon-0.7.1
cp tachyon-env.sh.template tachyon-env.sh
vi conf/workers - Add all worker nodes ip here (localhost will be default)

In Slaves

wget http://tachyon-project.org/downloads/files/0.7.1/tachyon-0.7.1-bin.tar.gz
tar -xvzf tachyon-0.7.1-bin.tar.gz
cd tachyon-0.7.1
./bin/tachyon bootstrap-conf <tachyon_master_hostname>

(This script needs to be run on each node you wish to configure.It will configure your workers to use 2/3 of the total memory on each worker.)

In Master Node

sudo ./bin/tachyon format
sudo ./bin/tachyon-start.sh all Mount
Now go to http://masterIP:19999/home

HDFS as underFS (Tachyon can run with different underlayer storage systems)

By default, Tachyon is set to use HDFS version 1.0.4. You can use another Hadoop version by changing the hadoop.version tag in pom.xml in Tachyon and recompiling it. You can also set the hadoop version when compiling with maven:

$ mvn -Dhadoop.version=2.2.0 clean package

After completing this,

Edit tachyon-env.sh file. And set TACHYON_UNDERFS_ADDRESS

TACHYON_UNDERFS_ADDRESS=hdfs://HDFS_HOSTNAME:HDFS_PORT.

Thats all

=======================================================================

Possible Errors :

http://devslogics.blogspot.in/2015/12/tachyon-was-not-formatted.html

For more : http://tachyon-project.org/documentation/v0.7.1/Running-Tachyon-on-a-Cluster.html

Tachyon was not formatted!

When you are using HDFS as your underFSAddress, you may faced this error.

By inspecting master.log you can find out the error log as follows.

devan@Dev-ThinkPad-X230:~/tachyon-0.7.1$ tailf logs/master.log

2015-12-18 10:14:34,571 ERROR MASTER_LOGGER (TachyonMaster.java:main) - Uncaught exception terminating Master
java.lang.IllegalStateException: Tachyon was not formatted! The journal folder is $TACHYON_HOME/journal/
at com.google.common.base.Preconditions.checkState(Preconditions.java:149)
at tachyon.master.TachyonMaster.<init>(TachyonMaster.java:151)
at tachyon.master.TachyonMaster.main(TachyonMaster.java:63)

Solution

You need to delete HDFS temporary files created by tachyon in your underFS HDFS server.

hadoop fs -rm -r /tmp/tachyon/

Note : Tachyon will create a directory structure as follows in underFS HDFS server,

/tmp/tachyon/data/1

/tmp/tachyon/data/2

....

Wednesday, December 16, 2015

permission denied for root@localhost for ssh connection

Reason :

SSH server denies password-based login for root by default.

Solution:

In /etc/ssh/sshd_config, change:

PermitRootLogin without-password

PermitRootLogin yes

And restart SSH:

sudo service ssh restart

Monday, December 7, 2015

Column renaming after DataFrame.groupBy and agg

In the following code, the column name is "SUM(_1#179)", is there a way to rename it to a more friendly name?

scala> val d = sqlContext.createDataFrame(Seq((1, 2), (1, 3), (2, 10)))
scala> d.groupBy("_1").sum().printSchema
root
|-- _1: integer (nullable = false)
|-- SUM(_1#179): long (nullable = true)
|-- SUM(_2#180): long (nullable = true)

http://apache-spark-user-list.1001560.n3.nabble.com/Column-renaming-after-DataFrame-groupBy-td22586.html

The simple way to achieve this is using toDF() function.

scala> val d = sqlContext.createDataFrame(Seq((1, 2), (1, 3), (2, 10)))
scala> d.groupBy("_1").sum().toDF("a","b","c").printSchema

root

|-- a: integer (nullable = false)

|-- b: long (nullable = true)

|-- c: long (nullable = true)

Thursday, December 3, 2015

IntelliJ IDEA : Error:scalac: Output path is shared between:Module .... , Output path is shared between: Module .... Please configure separate output paths to proceed with the compilation.

You may challenged by this error during your work with IntelliJ IDEA

Output path is shared between:Module .... , Output path is shared between: Module .... Please configure separate output paths to proceed with the compilation.

TIP: you can use Project Artifacts to combine compiled classes if needed.

Cause

This is because of existing multiple modules in the same IntelliJ IDEA project.

Solution

Remove other modules and keep single module file in IntelliJ IDEA project

Delete all other iml files without your project name (Project may be refactored to another one. Here Two modules available, SparkCDH and sparkcdh. Do not delete iml file with "-build" ) The result will be like follows.

Now you can run the project without multiple module shared error.