Wednesday, November 25, 2015

Hive Auto increment UDF in Apache Spark

All the functions we are expecting may not be supported or provided by Hive by default.
For example auto increment (row number) with select query.
Hive supporting User Defined Functions, using this feature user can create custom UDF for their use.

In this blog i am going to describe about how we can write one Auto Increment UDF in Hive and Apache Spark Hive. (Query should be run with Single mapper otherwise it will not work properly)

Code Snippet for Hive : 

@UDFType(deterministic = false, stateful = true)
public class AutoIncrUdf extends UDF{
      int lastValue;
    public int evaluate() {
     lastValue++;
        return lastValue;
   }
}  
After registering this class as UDF function* you can use the function name as showing below
hive> SELECT incr(),* FROM t1;
*For more about how to create the jar and use in hive please refer : http://hadooptutorial.info/writing-custom-udf-in-hive-auto-increment-column-hive/


Code Snippet for Spark Hive :

   var i = 0 //Define one variable for increment
    sqlContext.udf.register("incr", () =>{
      i= i+1
      i
    })
After this code (Register UDF function )you can use this function in hive query.

sqlContext.sql("select incr(),* from tableName")

Spark 1.4+ PermGenSize Error - IntelliJ IDEA

Error : 


java.lang.OutOfMemoryError: PermGen space
Stopping spark context.
Exception in thread "main" 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"


Solution : 

Give the following to the VM options.

Go to Run => Edit Configuration => Add the following line (I am giving 1GB, maxPermSize 512MB)

 -Xmx1024m -XX:MaxPermSize=512m -Xms512m





Thats it !!!