Thursday, December 17, 2015

Tachyon Cluster Configuration Setup Manual

Tachyon is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. It achieves high performance by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, thereby avoiding going to disk to load datasets that are frequently read. This enables different jobs/queries and frameworks to access cached files at memory speed.

Tachyon Cluster Configuration Setup Manual

In Master Node

wget http://tachyon-project.org/downloads/files/0.7.1/tachyon-0.7.1-bin.tar.gz
tar -xvzf tachyon-0.7.1-bin.tar.gz
cd tachyon-0.7.1
cp tachyon-env.sh.template tachyon-env.sh
vi conf/workers - Add all worker nodes ip here (localhost will be default)

In Slaves

wget http://tachyon-project.org/downloads/files/0.7.1/tachyon-0.7.1-bin.tar.gz
tar -xvzf tachyon-0.7.1-bin.tar.gz
cd tachyon-0.7.1
./bin/tachyon bootstrap-conf <tachyon_master_hostname>

(This script needs to be run on each node you wish to configure.It will configure your workers to use 2/3 of the total memory on each worker.)

In Master Node

sudo ./bin/tachyon format
sudo ./bin/tachyon-start.sh all Mount
Now go to http://masterIP:19999/home

HDFS as underFS (Tachyon can run with different underlayer storage systems)

By default, Tachyon is set to use HDFS version 1.0.4. You can use another Hadoop version by changing the hadoop.version tag in pom.xml in Tachyon and recompiling it. You can also set the hadoop version when compiling with maven:

$ mvn -Dhadoop.version=2.2.0 clean package

After completing this,

Edit tachyon-env.sh file. And set TACHYON_UNDERFS_ADDRESS

TACHYON_UNDERFS_ADDRESS=hdfs://HDFS_HOSTNAME:HDFS_PORT.

Thats all

=======================================================================

Possible Errors :

http://devslogics.blogspot.in/2015/12/tachyon-was-not-formatted.html

For more : http://tachyon-project.org/documentation/v0.7.1/Running-Tachyon-on-a-Cluster.html

Tachyon was not formatted!

When you are using HDFS as your underFSAddress, you may faced this error.

By inspecting master.log you can find out the error log as follows.

devan@Dev-ThinkPad-X230:~/tachyon-0.7.1$ tailf logs/master.log

2015-12-18 10:14:34,571 ERROR MASTER_LOGGER (TachyonMaster.java:main) - Uncaught exception terminating Master
java.lang.IllegalStateException: Tachyon was not formatted! The journal folder is $TACHYON_HOME/journal/
at com.google.common.base.Preconditions.checkState(Preconditions.java:149)
at tachyon.master.TachyonMaster.<init>(TachyonMaster.java:151)
at tachyon.master.TachyonMaster.main(TachyonMaster.java:63)

Solution

You need to delete HDFS temporary files created by tachyon in your underFS HDFS server.

hadoop fs -rm -r /tmp/tachyon/

Note : Tachyon will create a directory structure as follows in underFS HDFS server,

/tmp/tachyon/data/1

/tmp/tachyon/data/2

....

Wednesday, December 16, 2015

permission denied for root@localhost for ssh connection

Reason :

SSH server denies password-based login for root by default.

Solution:

In /etc/ssh/sshd_config, change:

PermitRootLogin without-password

PermitRootLogin yes

And restart SSH:

sudo service ssh restart

Monday, December 7, 2015

Column renaming after DataFrame.groupBy and agg

In the following code, the column name is "SUM(_1#179)", is there a way to rename it to a more friendly name?

scala> val d = sqlContext.createDataFrame(Seq((1, 2), (1, 3), (2, 10)))
scala> d.groupBy("_1").sum().printSchema
root
|-- _1: integer (nullable = false)
|-- SUM(_1#179): long (nullable = true)
|-- SUM(_2#180): long (nullable = true)

http://apache-spark-user-list.1001560.n3.nabble.com/Column-renaming-after-DataFrame-groupBy-td22586.html

The simple way to achieve this is using toDF() function.

scala> val d = sqlContext.createDataFrame(Seq((1, 2), (1, 3), (2, 10)))
scala> d.groupBy("_1").sum().toDF("a","b","c").printSchema

root

|-- a: integer (nullable = false)

|-- b: long (nullable = true)

|-- c: long (nullable = true)

Thursday, December 3, 2015

IntelliJ IDEA : Error:scalac: Output path is shared between:Module .... , Output path is shared between: Module .... Please configure separate output paths to proceed with the compilation.

You may challenged by this error during your work with IntelliJ IDEA

Output path is shared between:Module .... , Output path is shared between: Module .... Please configure separate output paths to proceed with the compilation.

TIP: you can use Project Artifacts to combine compiled classes if needed.

Cause

This is because of existing multiple modules in the same IntelliJ IDEA project.

Solution

Remove other modules and keep single module file in IntelliJ IDEA project

Delete all other iml files without your project name (Project may be refactored to another one. Here Two modules available, SparkCDH and sparkcdh. Do not delete iml file with "-build" ) The result will be like follows.

Now you can run the project without multiple module shared error.

Wednesday, November 25, 2015

Hive Auto increment UDF in Apache Spark

All the functions we are expecting may not be supported or provided by Hive by default.
For example auto increment (row number) with select query.
Hive supporting User Defined Functions, using this feature user can create custom UDF for their use.

In this blog i am going to describe about how we can write one Auto Increment UDF in Hive and Apache Spark Hive. (Query should be run with Single mapper otherwise it will not work properly)

Code Snippet for Hive :


@UDFType(deterministic = false, stateful = true)

public class AutoIncrUdf extends UDF{

      int lastValue;

    public int evaluate() {

     lastValue++;

        return lastValue;

   }

}

After registering this class as UDF function* you can use the function name as showing below






hive> SELECT incr(),* FROM t1;

*For more about how to create the jar and use in hive please refer : http://hadooptutorial.info/writing-custom-udf-in-hive-auto-increment-column-hive/

Code Snippet for Spark Hive :

   var i = 0 //Define one variable for increment
    sqlContext.udf.register("incr", () =>{
      i= i+1
      i
    })

After this code (Register UDF function )you can use this function in hive query.


  sqlContext.sql("select incr(),* from tableName")

Spark 1.4+ PermGenSize Error - IntelliJ IDEA

Error :

java.lang.OutOfMemoryError: PermGen space

Stopping spark context.

Exception in thread "main"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"

Solution :

Give the following to the VM options.

Go to Run => Edit Configuration => Add the following line (I am giving 1GB, maxPermSize 512MB)

-Xmx1024m -XX:MaxPermSize=512m -Xms512m

Thats it !!!

Wednesday, October 21, 2015

Apache Mesos single node cluster install on Ubuntu server using apt-get

Apache Mesos is a distributed scheduling framework which allows us to build a fault tolerant distributed system. It pools your infrastructure, automatically allocating resources and scheduling tasks based on demands and policy.

This blog post will describe about configuring/installing Mesos using simple apt get install.

1. Install requirements using the following commands


sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF



DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')



CODENAME=$(lsb_release -cs)



echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | 

sudo tee /etc/apt/sources.list.d/mesosphere.list

sudo apt-get update

sudo apt-get install mesos

2. Run Mesos Master using the following command


sudo mesos-master --ip=GIVE_MASTER_IP --work_dir=/tmp/mesos

3. Run Mesos Slave using the following commands


sudo mesos-slave --master=GIVE_MASTER_IP:5050 --resources='cpus:4;mem:8192;disk:409600;'

You can check if the cluster started or not by pointing your web browser at MASTER_IP:5050. If you can see 1 at under “Slaves -> Activated” then you have single node cluster running.

Wednesday, October 7, 2015

ActiveMQ MQTT broker Websockets support and Paho JavaScript Client integration

Hi all, this blog post is about configuring ActiveMQ MQTT broker to enable Websockets and subscribe MQTT using Websockets from JavaScript Paho client. Follow these steps

1. Configure ActiveMQ MQTT for Websocket support

a). Add the following line to activemq.xml file in your Apache-Activemq conf directory

<transportConnector name="ws" uri="ws://0.0.0.0:1884?maximumConnections=1000&amp;wireFormat.maxFrameSize=104857600"/>

b). Restart ActiveMQ. This will open 1884 port in your MQTT broker.

2. Download mqttws31.js file

http://git.eclipse.org/c/paho/org.eclipse.paho.mqtt.javascript.git/plain/src/mqttws31.js

3. Create one new file "config.js"

host = '127.0.1.1'; // hostname or IP address
port = 1884;
topic = 'mqtt/devTest';  // topic to subscribe to
useTLS = false;
username = null;
password = null;
// username = "devan";
// password = "devan123";
cleansession = true;

4. Download jquery.min.js

http://jquery.com/download/

5. Create one html file "mqttcheck.html"

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>MQTT Websockets</title>
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <script src="mqttws31.js" type="text/javascript"></script>
    <script src="jquery.min.js" type="text/javascript"></script>
    <script src="config.js" type="text/javascript"></script>

    <script type="text/javascript">
    var mqtt;
    var reconnectTimeout = 2000;

    function MQTTconnect() {
        mqtt = new Paho.MQTT.Client(
                        host,
                        port,
                        "web_" + parseInt(Math.random() * 100,
                        10));
        var options = {
            timeout: 3,
            useSSL: useTLS,
            cleanSession: cleansession,
            onSuccess: onConnect,
            onFailure: function (message) {
                $('#status').val("Connection failed: " + message.errorMessage + "Retrying");
                setTimeout(MQTTconnect, reconnectTimeout);
            }
        };

        mqtt.onConnectionLost = onConnectionLost;
        mqtt.onMessageArrived = onMessageArrived;

        if (username != null) {
            options.userName = username;
            options.password = password;
        }
        console.log("Host="+ host + ", port=" + port + " TLS = " + useTLS + " username=" + username + " password=" + password);
        mqtt.connect(options);
    }

    function onConnect() {
        $('#status').val('Connected to ' + host + ':' + port);
        // Connection succeeded; subscribe to our topic
        mqtt.subscribe(topic, {qos: 0});
        $('#topic').val(topic);
    }

    function onConnectionLost(response) {
        setTimeout(MQTTconnect, reconnectTimeout);
        $('#status').val("connection lost: " + responseObject.errorMessage + ". Reconnecting");

    };

    function onMessageArrived(message) {

        var topic = message.destinationName;
        var payload = message.payloadString;
document.getElementById("ws").innerHTML = 'Topic =' + topic + '<br> Payload = ' + payload ;
     
    };
    $(document).ready(function() {
        MQTTconnect();
    });

    </script>
  </head>
  <body>
<center>
    <h1>MQTT Websockets</h1>
    <div>
        <div>Subscribed to <input type='text' id='topic' disabled />
        Status: <input type='text' id='status'  disabled /></div>
<h2>
        <div id='ws'> </div>
 </h2>
   </div>
</center>
  </body>
</html>

6. Run mqttcheck.html file using your web browser you will see the result like following.

Download source code : https://goo.gl/jK3TsV

Fancy Interface source code : https://goo.gl/IsVs0A

Friday, July 31, 2015

Would you explain, in simple terms, exactly what object-oriented software is? Here is the answer by Steve Jobs

I wish i had a teacher like him to explain how OOPs concept works. :)
Here is the answer to the question by Steve Jobs in 1994 Rolling Stone interview

Objects are like people. They're living, breathing things that have knowledge inside them about how to do things and have memory inside them so they can remember things. And rather than interacting with them at a very low level, you interact with them at a very high level of abstraction, like we're doing right here.

Here's an example: If I'm your laundry object, you can give me your dirty clothes and send me a message that says, "Can you get my clothes laundered, please." I happen to know where the best laundry place in San Francisco is. And I speak English, and I have dollars in my pockets. So I go out and hail a taxicab and tell the driver to take me to this place in San Francisco. I go get your clothes laundered, I jump back in the cab, I get back here. I give you your clean clothes and say, "Here are your clean clothes."

You have no idea how I did that. You have no knowledge of the laundry place. Maybe you speak French, and you can't even hail a taxi. You can't pay for one, you don't have dollars in your pocket. Yet I knew how to do all of that. And you didn't have to know any of it. All that complexity was hidden inside of me, and we were able to interact at a very high level of abstraction. That's what objects are. They encapsulate complexity, and the interfaces to that complexity are high level.

Thursday, April 30, 2015

Building chromium from source code with depot_tools

Install depot_tools

Confirm git is installed. git 2.2.1+ recommended.
Fetch depot_tools: (from home directory)

$ git clone https://chromium.googlesource.com/chromium/tools/depot_tools.git
Add depot_tools to your ~/.bashrc file:

export GYP_GENERATORS=ninja
export PATH=$PATH:$HOME/depot_tools
export CHROME_DEVEL_SANDBOX=$HOME/chromium/src/out/Debug/chrome_sandbox

Yes, you want to put depot_tools ahead of everything else, otherwise gcl will refer to the GNU Common Lisp compiler.

Get the Chromium Source Code

create a new directory chromium in home

mkdir ~/chromium

cd ~/chromium

git clone --depth 1 https://chromium.googlesource.com/chromium/src.git

The depth argument results in a shallow clone so that you don't pull down the massive history. You can remove it if you want full copy. It will take a lot of time (~30 minutes) to complete

cd ~/chromium/src

fetch —nohooks —no-history chromium —nosvn=True

This also take a lot of time to fetch.

Install any necessary dependencies

$
./build/install-build-deps.sh

Run post-sync hooks

Finally, runhooks to run any post-sync scripts

$
gclient runhooks -force

ninja -C out/Debug chromium

Set Up the Sandbox

cd ~/chromium/src

ninja -C out/Debug chrome_sandbox
sudo chown root:root out/Debug/chrome_sandbox
sudo chmod 4755 out/Debug/chrome_sandbox

Run Chromium

cd ~/chromium/src

out/Debug/chrome

Run shell script out/Debug/chrome-wrapper
If sandbox error shows please run this command for disable sandbox and run. (Please note, withut sandbox, security issues may come)

./out/Debug/chrome-wrapper --no-sandbox %U

Here we go, browse using your own browser ... :)

Friday, April 17, 2015

Turn off hibernate logging to console

Using hibernate with java is simply brilliant.

But the logs like this may make you think differently :)

Hibernate: select securityus0_.ID ....
Hibernate: select securityus0_.ID ....
Hibernate: select securityus0_.ID ....
Hibernate: select securityus0_.ID ....

So you may need to turn off these logs, that is simple.

Change the property show_sql from true to false in your Hibernate config file.

Monday, February 23, 2015

Log4j with Maven

1. Add dependency.

</dependency>

2. Add log4j.properties file inside "src/main/java" package (Because when we are building the project this file will reside in /target/classes, you can select any package which is a java build path source )

# Root logger option

log4j.rootLogger=DEBUG, stdout, file

# Redirect log messages to console

log4j.appender.stdout=org.apache.log4j.ConsoleAppender

log4j.appender.stdout.Target=System.out

log4j.appender.stdout.layout=org.apache.log4j.PatternLayout

log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n

# Redirect log messages to a log file, support file rolling.

log4j.appender.file=org.apache.log4j.RollingFileAppender

log4j.appender.file.File=GIVE_PATH_FOR_LOGFILE

log4j.appender.file.MaxFileSize=5MB

log4j.appender.file.MaxBackupIndex=10

log4j.appender.file.layout=org.apache.log4j.PatternLayout

log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n

3. Main function

import org.apache.log4j.Logger;

import org.apache.log4j.PropertyConfigurator;

public class LogMain {

final static Logger logger = Logger.getLogger(LogMain.class);

public static void main(String[] args) {

logger.debug("test debug");

logger.error("test error");

}

4. Run !!

You will get this in “GIVE_PATH_FOR_LOGFILE”

2015-02-23 15:42:39 DEBUG LogMain:12 - test debug

2015-02-23 15:42:39 ERROR LogMain:13 - test error

Monday, January 5, 2015

Sqoop2 client 1.99.4

There are several changes added in 1.99.4 version.
So this may helpful to you.

import org.apache.sqoop.client.SqoopClient;
import org.apache.sqoop.model.MFromConfig;
import org.apache.sqoop.model.MJob;
import org.apache.sqoop.model.MLink;
import org.apache.sqoop.model.MLinkConfig;
import org.apache.sqoop.model.MSubmission;
import org.apache.sqoop.model.MToConfig;
import org.apache.sqoop.submission.counter.Counter;
import org.apache.sqoop.submission.counter.CounterGroup;
import org.apache.sqoop.submission.counter.Counters;
import org.apache.sqoop.validation.Status;

public class MysqlToHDFS {
public static void main(String[] args) {

String connectionString = "jdbc:mysql://YourMysqlIP:3306/test";
String username = "YourMysqUserName";
String password = "YourMysqlPassword";
String schemaName = "YourMysqlDB";
String tableName = "Persons";
String partitionColumn = "PersonID";
String outputDirectory = "/output/Persons";
String url = "http://YourSqoopIP:12000/sqoop/";
String hdfsURI = "hdfs://namenodeIP:8020/";
SqoopClient client = new SqoopClient(url);
long fromConnectorId = 2;
MLink fromLink = client.createLink(fromConnectorId);
fromLink.setName("JDBC connector1");
fromLink.setCreationUser("devan");
MLinkConfig fromLinkConfig = fromLink.getConnectorLinkConfig();
fromLinkConfig.getStringInput("linkConfig.connectionString").setValue(connectionString);
fromLinkConfig.getStringInput("linkConfig.jdbcDriver").setValue("com.mysql.jdbc.Driver");
fromLinkConfig.getStringInput("linkConfig.username").setValue(username);
fromLinkConfig.getStringInput("linkConfig.password").setValue(password);
Status fromStatus = client.saveLink(fromLink);
if (fromStatus.canProceed()) {
System.out.println("JDBC Link，ID : " + fromLink.getPersistenceId());
} else {
System.out.println("JDBC Link");
}
// create HDFS connector
long toConnectorId = 1;
MLink toLink = client.createLink(toConnectorId);
toLink.setName("HDFS connector");
toLink.setCreationUser("devan");
MLinkConfig toLinkConfig = toLink.getConnectorLinkConfig();
toLinkConfig.getStringInput("linkConfig.uri").setValue(hdfsURI);
Status toStatus = client.saveLink(toLink);
if (toStatus.canProceed()) {
System.out.println("HDFS Link，ID: " + toLink.getPersistenceId());
} else {
System.out.println("HDFS Link");
}
long fromLinkId = fromLink.getPersistenceId();
long toLinkId = toLink.getPersistenceId();
MJob job = client.createJob(fromLinkId, toLinkId);//create job with jdbc and hdfs links
job.setName("MySQL to HDFS job");
job.setCreationUser("devan");
MFromConfig fromJobConfig = job.getFromJobConfig();
fromJobConfig.getStringInput("fromJobConfig.schemaName").setValue(schemaName);
fromJobConfig.getStringInput("fromJobConfig.tableName").setValue(tableName);
fromJobConfig.getStringInput("fromJobConfig.partitionColumn").setValue(partitionColumn);
MToConfig toJobConfig = job.getToJobConfig();
toJobConfig.getStringInput("toJobConfig.outputDirectory").setValue(outputDirectory);
Status status = client.saveJob(job);
if (status.canProceed()) {
System.out.println("JOB，ID: " + job.getPersistenceId());
} else {
System.out.println("Job can't be created");
}
long jobId = job.getPersistenceId();
MSubmission submission = client.startJob(jobId);
System.out.println("JOB : " + submission.getStatus());
while (submission.getStatus().isRunning()
&& submission.getProgress() != -1) {
System.out.println("JOB: "
+ String.format("%.2f %%", submission.getProgress() * 100));
try {
Thread.sleep(3000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
System.out.println("JOB... ...");
System.out.println("Hadoop :" + submission.getExternalId());
Counters counters = submission.getCounters();
if (counters != null) {
System.out.println(":");
for (CounterGroup group : counters) {
System.out.print("\t");
System.out.println(group.getName());
for (Counter counter : group) {
System.out.print("\t\t");
System.out.print(counter.getName());
System.out.print(": ");
System.out.println(counter.getValue());
}
}
}
if (submission.getExceptionInfo() != null) {
System.out.println("JOB : " + submission.getExceptionInfo());
}
System.out.println("sqoop job successfully submitted");
}
}

==========================================================
If you are creating one maven project, add the following as dependency.

<dependency>
<groupId>org.apache.sqoop</groupId>
<artifactId>sqoop-client</artifactId>
<version>1.99.4</version>
</dependency>

DevsLogics

Thursday, December 17, 2015

Tachyon Cluster Configuration Setup Manual

Tachyon was not formatted!

Wednesday, December 16, 2015

permission denied for root@localhost for ssh connection

Reason :

Solution:

Monday, December 7, 2015

Column renaming after DataFrame.groupBy and agg

Thursday, December 3, 2015

IntelliJ IDEA : Error:scalac: Output path is shared between:Module .... , Output path is shared between: Module .... Please configure separate output paths to proceed with the compilation.

Cause

Solution

Wednesday, November 25, 2015

Hive Auto increment UDF in Apache Spark

Spark 1.4+ PermGenSize Error - IntelliJ IDEA

Error :

Solution :

Wednesday, October 21, 2015

Apache Mesos single node cluster install on Ubuntu server using apt-get

Wednesday, October 7, 2015

ActiveMQ MQTT broker Websockets support and Paho JavaScript Client integration

Friday, July 31, 2015

Would you explain, in simple terms, exactly what object-oriented software is? Here is the answer by Steve Jobs

Thursday, April 30, 2015

Building chromium from source code with depot_tools

Install depot_tools

Get the Chromium Source Code

Install any necessary dependencies

Set Up the Sandbox

Run Chromium

Friday, April 17, 2015

Turn off hibernate logging to console

Monday, February 23, 2015

Log4j with Maven

Log4j with Maven

Monday, January 5, 2015

Sqoop2 client 1.99.4

Popular Posts