Tech Tutorials

BeanFactoryPostProcessor interface in Spring resides in org.springframework.beans.factory.config package. BeanFactoryPostProcessor implementation is used to read the configuration metadata and potentially change it before beans are instantiated by IOC container.

You can configure multiple BeanFactoryPostProcessors, you can also control the order in which these BeanFactoryPostProcessors execute by setting the order property. You can set the order property only if the BeanFactoryPostProcessor implements the Ordered interface.

BeanFactoryPostProcessor interface in Spring

BeanFactoryPostProcessor interface is a functional interface meaning it has a single abstract method, that method is postProcessBeanFactory() which you need to implement in order to custom modify the bean definitions.


@FunctionalInterface
public interface BeanFactoryPostProcessor {
    void postProcessBeanFactory(ConfigurableListableBeanFactory beanFactory) 
     throws BeansException;
}

Usage of BeanFactoryPostProcessor in Spring

The implementations of BeanFactoryPostProcessor interface are used by Spring framework itself. When you read from property files in Spring and configure the <context:property-placeholder> element that registers PropertySourcesPlaceholderConfigurer which implements BeanFactoryPostProcessor interface and set the properties there in the bean.

Spring BeanFactoryPostProcessor example

Here let’s have a simple example of BeanFactoryPostProcessor in Spring.

The scenario is that you have set the properties in a property file for DB config but for a particular run you want to use the separate schema which is set up in such a way that all the otehr properties remain same except the url. Which means you want to override the url property of the DataSource and modify it so that you can connect to the new Schema.

Though a better option would be to create separate profiles and switch among those profiles but you can access bean definition and modify the value of the property using the BeanFactoryPostProcessor.

db.properties file


db.driverClassName=com.mysql.jdbc.Driver
db.url=jdbc:mysql://localhost:3306/netjs
db.username=root
db.password=admin
pool.initialSize=5

XML configuration for the datasource


<bean id="dataSource" class="org.apache.commons.dbcp2.BasicDataSource">
<property name="driverClassName" value = "${db.driverClassName}" />
<property name="url" value = "${db.url}" />
<property name="username" value = "${db.username}" />
<property name="password" value = "${db.password}" />
<property name="initialSize" value = "${pool.initialSize}" /

</bean>

BeanFactoryPostProcessor implementation


import org.springframework.beans.BeansException;
import org.springframework.beans.MutablePropertyValues;
import org.springframework.beans.PropertyValue;
import org.springframework.beans.factory.config.BeanDefinition;
import org.springframework.beans.factory.config.BeanFactoryPostProcessor;
import org.springframework.beans.factory.config.ConfigurableListableBeanFactory;
import org.springframework.core.Ordered;

public class TestDBPostProcessor implements BeanFactoryPostProcessor, Ordered {

 @Override
 public void postProcessBeanFactory(ConfigurableListableBeanFactory beanFactory) 
         throws BeansException {
  System.out.println("In postProcessBeanFactory");
  // Getting the dataSource bean
  BeanDefinition bd = beanFactory.getBeanDefinition("dataSource");
  if(bd.hasPropertyValues()){
   MutablePropertyValues pvs = bd.getPropertyValues();
   PropertyValue[] pvArray = pvs.getPropertyValues();
   for (PropertyValue pv : pvArray) {
    System.out.println("pv -- " + pv.getName());
    // changing value for url property
    if(pv.getName().equals("url")){
     pvs.add(pv.getName(), "jdbc:mysql://localhost:3306/TestSchema");
    }
   }
  } 
 }
 @Override
 public int getOrder() {
  // TODO Auto-generated method stub
  return 0;
 }
}

As you can see in the method postProcessBeanFactory() you can get the dataSource bean and modify the bean definition.

To register the BeanFactoryPostProcessor add the following line in your configuration.


<bean class="org.netjs.config.TestDBPostProcessor"  />

Here is the method where I want to use the new schema.


public List<Employee> findAllEmployees() {
      System.out.println("URL " + ((BasicDataSource)jdbcTemplate.getDataSource()).getUrl());
      return this.jdbcTemplate.query(SELECT_ALL_QUERY, (ResultSet rs) -> {
            List<Employee> list = new ArrayList<Employee>();  
            while(rs.next()){
                Employee emp = new Employee();
                emp.setEmpId(rs.getInt("id"));
                emp.setEmpName(rs.getString("name"));
                emp.setAge(rs.getInt("age"));
                list.add(emp);
            }
            return list;
        });

}

To run this example following code can be used.


public class App {

    public static void main(String[] args) {

        ClassPathXmlApplicationContext context = new ClassPathXmlApplicationContext
         ("appcontext.xml");
        EmployeeDAO dao = (EmployeeDAO)context.getBean("employeeDAOImpl");  
        List<Employee> empList = dao.findAllEmployees();
        for(Employee emp : empList){
            System.out.println("Name - "+ emp.getEmpName() + " Age - " 
          + emp.getAge());
        }
        context.close();    
    }
}

Output


Relevant lines from the console. 

In postProcessBeanFactory
pv -- driverClassName
pv -- url
pv -- username
pv -- password
pv – initialSize

URL jdbc:mysql://localhost:3306/TestSchema

That's all for this topic BeanFactoryPostProcessor in Spring Framework. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Spring Tutorial Page

In the post Select Query Using JDBCTemplate in Spring Framework we have already seen an example of extracting data from ResultSet using RowMapper. A RowMapper is usually a simpler choice for ResultSet processing, mapping one result object per row but there is another option in Spring framework known as ResultSetExtractor which gives one result object for the entire ResultSet.

In this post we’ll see an example of JDBCTemplate along with ResultSetExtractor. Since ResultSetExtractor is a callback interface used by JdbcTemplate's query methods so you can use an instance of ResultSetExtractor with JDBCTemplate’s query method.

query method signature with ResultSetExtractor


public <T> T query(java.lang.String sql, ResultSetExtractor<T> rse) throws DataAccessException

After the execution of the query, ResultSet can be read with a ResultSetExtractor.

ResultSetExtractor interface in Spring

ResultSetExtractor interface is a functional interface used by JDBCTemplate’s query method. It has a single abstract method extractData().


T extractData(java.sql.ResultSet rs) throws java.sql.SQLException, DataAccessException

Implementing class must implement this method to process the entire ResultSet.

Example using JDBCTemplate and ResultSetExtractor interface

Technologies used


Spring 5.0.4
Apache DBCP2
MYSQL 5.1.39
Java 8
Apache Maven 3.3.3

Maven dependencies

If you are using Apache Maven then you can provide dependencies in your pom.xml.

Refer Creating a Maven project in Eclipse to see how to set up Maven project.

With all the dependencies your pom.xml should look something like this -


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>org.netjs.prog</groupId>
<artifactId>maven-spring</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>

<name>maven-spring</name>
<url>http://maven.apache.org</url>

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<spring.version>5.0.4.RELEASE</spring.version>
</properties>

<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-core</artifactId>
<version>${spring.version}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context</artifactId>
<version>${spring.version}</version>
</dependency>

<dependency>
<groupId>javax.inject</groupId>
<artifactId>javax.inject</artifactId>
<version>1</version>
</dependency>

<!-- Spring JDBC Support -->
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-jdbc</artifactId>
<version>${spring.version}</version>
</dependency>

<!-- MySQL Driver -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.39</version>
</dependency>

<!--  Apache DBCP connection pool -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-dbcp2</artifactId>
<version>2.1</version>
</dependency>
</dependencies>
</project>

Alternatively you can download the jars and add them to the class path.

Database table

For this example I have created a table called employee with the columns id, name and age in the MYSQL DB. Column id is configured as auto increment checked so no need to pass id from your query as DB will provide value for it.


CREATE TABLE `employee` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(35) DEFAULT NULL,
  `age` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
)

Setting up dependencies

In the bean for JDBCTemplate you will need to set the DataSource bean as dependency. In the DataSource bean you will need to provide DB properties. It is better if you read DB configuration parameters from a properties file.

Property file db.properties which is under the config folder has all the properties.

db.properties


db.driverClassName=com.mysql.jdbc.Driver
db.url=jdbc:mysql://localhost:3306/netjs
db.username=
db.password=
pool.initialSize=5

Description for the properties used here is as -

driver class name is the JDBC driver for the DB used. Since MYSQL is used here so the jdbc driver for the same (com.mysql.jdbc.Driver) is provided.

Url – You need to provide url to access your DB server. I have created a schema called netjs and DB is running on the same system so url is jdbc:mysql://localhost:3306/netjs.

Username and password for the DB.

IntialSize is the initial size of the connection pool. It is given as 5 so initially 5 connections will be created and stored in the pool.

Refer How to Read Properties File in Spring Framework to know more about how to read properties file in Spring framework.

XML Configuration


<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:aop="http://www.springframework.org/schema/aop"
    xmlns:context="http://www.springframework.org/schema/context"
    xmlns:c="http://www.springframework.org/schema/c"
    xmlns:p="http://www.springframework.org/schema/p"
    xsi:schemaLocation="http://www.springframework.org/schema/beans 
    http://www.springframework.org/schema/beans/spring-beans-4.0.xsd
    http://www.springframework.org/schema/context
    http://www.springframework.org/schema/context/spring-context.xsd">

<context:property-placeholder location="classpath:config/db.properties" />                  

<context:component-scan base-package="org.netjs.daoimpl" />

<bean id="dataSource" class="org.apache.commons.dbcp2.BasicDataSource">
<property name="driverClassName" value = "${db.driverClassName}" />
<property name="url" value = "${db.url}" />
<property name="username" value = "${db.username}" />
<property name="password" value = "${db.password}" />
<property name="initialSize" value = "${pool.initialSize}" />
</bean>

<bean id="jdbcTemplate" class="org.springframework.jdbc.core.JdbcTemplate">  
<property name="dataSource" ref="dataSource"></property>  
</bean>

</beans>

Java Classes

Since Spring always promotes to use interfaces and there is also a JEE design pattern for database layer called DAO which also says the same thing - Separate low level data access code from the business layers.

So we have a EmployeeDAO interface with find methods and its implementing class EmployeeDAOImpl. There is also a model class Employee with all the getters/setters.

Employee.java class


public class Employee {
 private int empId;
 private String empName;
 private int age;

 public int getEmpId() {
  return empId;
 }
 public void setEmpId(int empId) {
  this.empId = empId;
 }
 public String getEmpName() {
  return empName;
 }
 public void setEmpName(String empName) {
  this.empName = empName;
 }
 public int getAge() {
  return age;
 }
 public void setAge(int age) {
  this.age = age;
 }
}

EmployeeDAO interface


import org.netjs.model.Employee;
public interface EmployeeDAO {
    public List<Employee> findAllEmployees();
}

EmployeeDAOImpl class


@Repository
public class EmployeeDAOImpl implements EmployeeDAO {
    @Autowired
    private JdbcTemplate jdbcTemplate; 

    final String SELECT_ALL_QUERY = "SELECT id, name, age from EMPLOYEE";
    public void setJdbcTemplate(JdbcTemplate jdbcTemplate) {  
        this.jdbcTemplate = jdbcTemplate;  
    }
    public List<Employee> findAllEmployees() {
        return this.jdbcTemplate.query(SELECT_ALL_QUERY, 
         new ResultSetExtractor<List<Employee>>() {

            @Override
            public List<Employee> extractData(ResultSet rs)
                    throws SQLException, DataAccessException {
                List<Employee> list = new ArrayList<Employee>();  
                while(rs.next()){
                    Employee emp = new Employee();
                    emp.setEmpId(rs.getInt("id"));
                    emp.setEmpName(rs.getString("name"));
                    emp.setAge(rs.getInt("age"));
                    list.add(emp);
                }
                return list;
            }  
        });
    }
}

Notice how you are not writing any code for getting or closing connection, exception handling. All that fixed part is managed by the JDBCTemplate class. Its the JDBCTemplate which is getting the connection using the DataSource provided to it, creating and executing the statement and closing the connection.

If there is any SQLException thrown that is also caught by JDBCTemplate and translated to one of the DataAccessException and rethrown.

ResultSetExtractor interface is implemented as an anonymous inner class and you can see the implementation of extractData method where the returned ResultSet is processed.

Test class

You can use the following code in order to test the code -


public class App {
    public static void main(String[] args) {

        ClassPathXmlApplicationContext context = new ClassPathXmlApplicationContext("appcontext.xml");
        EmployeeDAO dao = (EmployeeDAO)context.getBean("employeeDAOImpl");  
        List<Employee> empList = dao.findAllEmployees();
        for(Employee emp : empList){
            System.out.println("Name - "+ emp.getEmpName() + " Age - " 
          + emp.getAge());
        }
        context.close();    
    }
}

ResultSetExtractor implemented as a Lambda expression

Since ResultSetExtractor is a functional interface so Java 8 onwards it can also be implemented as lambda expression.


public List<Employee> findAllEmployees() {
      return this.jdbcTemplate.query(SELECT_ALL_QUERY, (ResultSet rs) -> {
           List<Employee> list = new ArrayList<Employee>();  
           while(rs.next()){
               Employee emp = new Employee();
               emp.setEmpId(rs.getInt("id"));
               emp.setEmpName(rs.getString("name"));
               emp.setAge(rs.getInt("age"));
               list.add(emp);
           }
           return list;
      });
}

That's all for this topic JDBCTemplate With ResultSetExtractor Example in Spring. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Spring Tutorial Page

If you are using JavaConfig in Spring to configure the bean definitions then, in order to modularize your configurations you can use @Import annotation.

@Import annotation in Spring JavaConfig allows for loading @Bean definitions from another configuration class so you can group your configuration by modules or functionality which makes your code easy to maintain.

@Import annotation is similar to <import/> element which is used to divide the large Spring XML configuration file into smaller XMLs and then import those resources.

How @Import annotation works

Let’s say you have two configuration classes ConfigA and ConfigB then you can import ConfigA into ConfigB as shown below.


@Configuration
public class ConfigA {
 @Bean
 public A a() {
  return new A();
 }
}

@Configuration
@Import(ConfigA.class)
public class ConfigB {
 @Bean
 public B b() {
  return new B();
 }
}

Then you don’t need to specify both ConfigA.class and ConfigB.class when instantiating the context, so this is not required.


ApplicationContext ctx = new AnnotationConfigApplicationContext(ConfigA.class, ConfigB.class);

As bean definitions of ConfigA are already loaded by using @Import annotation with ConfigB bean, only ConfigB needs to be specified explicitly.


ApplicationContext ctx = new AnnotationConfigApplicationContext( ConfigB.class);

@Import annotation example

Let’s see a proper example using @Import annotation and Spring JavaConfig. The objective is to insert a record into DB using NamedParameterJDBCTemplate. For that you need a DataSource, NamedParameterJDBCTemplate configured with that DataSource and a Class where NamedParameterJDBCTemplate is used to insert a record in DB. We’ll use separate config classes in order to have a modular code.

Refer Spring Example Program Using JavaConfig And Annotations to see an example of using JavaConfig in Spring.
Refer Insert\Update Using NamedParameterJDBCTemplate in Spring Framework to see an example of using NamedParameterJDBCTemplate in Spring Framework.

DataSource Configuration


import org.apache.commons.dbcp2.BasicDataSource;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Import;
import org.springframework.context.annotation.PropertySource;
import org.springframework.core.env.Environment;

@Configuration
@PropertySource(value="classpath:config/db.properties", ignoreResourceNotFound=true)
@Import({EmpConfig.class, JDBCConfig.class})
public class DBConfig {
 @Autowired
 private Environment env;

 @Bean
 public BasicDataSource dataSource() {
  BasicDataSource ds = new BasicDataSource();
  System.out.println("User " + env.getProperty("db.username"));
  ds.setDriverClassName(env.getProperty("db.driverClassName"));
  ds.setUrl(env.getProperty("db.url"));
  ds.setUsername(env.getProperty("db.username"));
  ds.setPassword(env.getProperty("db.password"));
  return ds;
 }
}

Here note that DB properties are read from a properties file in Spring. DataSource used is Apache BasicDataSource.

NamedParameterJDBCTemplate configuration


import org.apache.commons.dbcp2.BasicDataSource;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.jdbc.core.namedparam.NamedParameterJdbcTemplate;

public class JDBCConfig {
 private final BasicDataSource dataSource;

 @Autowired
 public JDBCConfig(BasicDataSource dataSource) {
  this.dataSource = dataSource;
 }

 @Bean
 public NamedParameterJdbcTemplate namedJdbcTemplate() {
  return new NamedParameterJdbcTemplate(dataSource);
 }
}

EmpConfig Class


import org.netjs.dao.EmployeeDAO;
import org.netjs.daoimpl.EmployeeDAOImpl1;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.jdbc.core.namedparam.NamedParameterJdbcTemplate;

public class EmpConfig {

 @Autowired
 private NamedParameterJdbcTemplate namedJdbcTemplate;
 @Bean
 public EmployeeDAO empService() {
  return new EmployeeDAOImpl(namedJdbcTemplate);
 }
}

EmployeeDAO class


public interface EmployeeDAO {
 public int save(Employee employee);

}

EmployeeDAOImpl class


public class EmployeeDAOImpl1 implements EmployeeDAO {

    private NamedParameterJdbcTemplate namedJdbcTemplate;  
    final String INSERT_QUERY = "insert into employee (name, age) values (:name, :age)";

    public EmployeeDAOImpl1(NamedParameterJdbcTemplate namedJdbcTemplate){
        this.namedJdbcTemplate = namedJdbcTemplate;
    }

    @Override
    public int save(Employee employee) {
        // Creating map with all required params
        Map<String, Object> paramMap = new HashMap<String, Object>();
        paramMap.put("name", employee.getEmpName());
        paramMap.put("age", employee.getAge());
        // Passing map containing named params
        return namedJdbcTemplate.update(INSERT_QUERY, paramMap);  
    }
}

Employee Bean


public class Employee {
 private int empId;
 private String empName;
 private int age;

 public void setEmpId(int empId) {
  this.empId = empId;
 }

 public void setEmpName(String empName) {
  this.empName = empName;
 }

 public void setAge(int age) {
  this.age = age;
 }

 public int getEmpId() {
  return empId;
 }

 public String getEmpName() {
  return empName;
 }

 public int getAge() {
  return age;
 }
}

You can run this example using the following code.


public class App {

 public static void main(String[] args) {

   AbstractApplicationContext context = new AnnotationConfigApplicationContext
           (DBConfig.class);
   EmployeeDAO empBean = (EmployeeDAO)context.getBean("empService");
   Employee emp = new Employee();
   emp.setEmpName("Jacko");
   emp.setAge(27);
   int status = empBean.save(emp);  
        context.close();
 }

}

As you can see only DBConfig.class is specified here not all the three config classes, still you can get empService bean which is defined in EmpConfig.

That's all for this topic @Import Annotation in Spring JavaConfig. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Spring Tutorial Page

ServiceLocatorFactoryBean in Spring framework as the name suggests is an implementation of service locator design pattern and helps with locating the service at run time.

ServiceLocatorFactoryBean helps if you have more than one implementation of the same type and want to use the appropriate implementation at the run time i.e. you have an interface and more than one class implementing that interface and you want to have a factory that will return an appropriate object at run time.

How ServiceLocatorFactoryBean works

ServiceLocatorFactoryBean class in Spring has a field serviceLocatorInterface that takes an interface which must have one or more methods with the signatures MyService getService() or MyService getService(String id)). Spring framework creates a dynamic proxy which implements that interface, delegating to an underlying BeanFactory.

When you call getService(String id)) method of your interface passing the bean id as an argument, internally internally through the proxy BeanFactory.getBean(String) is called, returning the bean whose name is passed as an argument to the getBean() method of the BeanFactory.

Advantage of having a ServiceLocatorFactoryBean is that such service locators permit the decoupling of calling code from the BeanFactory API, by using an appropriate custom locator interface. They will typically be used for prototype beans, i.e. for factory methods that are supposed to return a new instance for each call.

ServiceLocatorFactoryBean Example

Let’s see an example to make things clearer as the theory looks quite complex!

We want to determine at the run time based on the input (argument passed) whether to make a cash payment or card payment. So we have an interface IPayment and implementing classes CashPayment and CardPayment.


public interface IPayment{
 void executePayment();
}

public class CashPayment implements IPayment{

 public void executePayment() {
  System.out.println("Perform Cash Payment "); 
 }
}

public class CardPayment implements IPayment{
 public void executePayment() {
  System.out.println("Perform Card Payment "); 
 }
}

Service locator interface

This service locator interface will be injected in ServiceLocatorFactoryBean.


public interface PaymentFactory {
 public IPayment getPayment(String paymentType);
}

Here is the bean where ServiceLocatorFactoryBean will be injected, it also has a method where the injected factory bean instance is used to get the required bean by bean name.


import org.springframework.beans.factory.annotation.Autowired;

public class PaymentService {
 @Autowired
 private PaymentFactory paymentFactory;

 public void setPaymentFactory(PaymentFactory paymentFactory) {
  this.paymentFactory = paymentFactory;
 }

 public void makePayment(String paymentType){
  IPayment payment = paymentFactory.getPayment(paymentType);
  payment.executePayment();
 }
}

XML Configuration


<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:context="http://www.springframework.org/schema/context"
    xsi:schemaLocation="http://www.springframework.org/schema/beans 
    http://www.springframework.org/schema/beans/spring-beans-4.0.xsd
    http://www.springframework.org/schema/context
    http://www.springframework.org/schema/context/spring-context.xsd">

<context:annotation-config/> 

<!-- Prototype bean since we have state -->
<bean id="cashPayment" class="org.netjs.exp.Spring_Example.CashPayment" 
     scope="prototype" />
<bean id="cardPayment" class="org.netjs.exp.Spring_Example.CardPayment" 
     scope="prototype" />

<!-- ServiceLocatorFactoryBean -->
<bean id="paymentFactory" 
      class="org.springframework.beans.factory.config.ServiceLocatorFactoryBean">
<property name="serviceLocatorInterface" value="org.netjs.exp.Spring_Example.PaymentFactory"/>
</bean>

<bean id="payServiceBean" class="org.netjs.exp.Spring_Example.PaymentService">
</bean>

</beans>

You can the example using the following code.


import org.springframework.context.support.AbstractApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;

public class App {
    public static void main( String[] args ){
     AbstractApplicationContext context = new ClassPathXmlApplicationContext
      ("appcontext.xml");

     PaymentService payService = (PaymentService)context.getBean("payServiceBean");

     payService.makePayment("cardPayment");

     context.registerShutdownHook();   
 } 
}

Note that "cardPayment" is passed here as the payment type. From the service factory dynamic proxy implementation Spring framework will internally call BeanFactory.getBean(“cardPayment”) to return an instance of the specified bean.

Reference: https://docs.spring.io/spring/docs/current/javadoc-api/org/springframework/beans/factory/config/ServiceLocatorFactoryBean.html

That's all for this topic ServiceLocatorFactoryBean in Spring. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Spring Tutorial Page

In an interview if difference between BeanFactory and ApplicationContext is asked one of the reason given by people to use ApplicationContext is the support for internationalization provided by ApplicationContext. This post shows the same thing; how Internationalization (i18n) using ApplicationContext and MessageSource can be provided in Spring.

ApplicationContext and MessageSource

The ApplicationContext interface extends an interface called MessageSource, and therefore provides internationalization (i18n) functionality. Spring also provides the interface HierarchicalMessageSource, which can resolve messages hierarchically. Using these two interfaces Spring effects message resolution. The methods defined on these interfaces include:

String getMessage(String code, Object[] args, String default, Locale loc)- The basic method used to retrieve a message from the MessageSource. When no message is found for the specified locale, the default message is used. In the properties file it will look for the key which is having the same values as code parameter.
String getMessage(String code, Object[] args, Locale loc)- Essentially the same as the previous method, but with one difference: no default message can be specified; if the message cannot be found, a NoSuchMessageException is thrown.
String getMessage(MessageSourceResolvable resolvable, Locale locale)- All properties used in the preceding methods are also wrapped in a class named MessageSourceResolvable, which you can use with this method.

When an ApplicationContext is loaded, it automatically searches for a MessageSource bean defined in the context. The bean must have the name messageSource. If such a bean is found, all calls to the preceding methods are delegated to the message source.

There are two MessageSource implementations provided by Spring framework ResourceBundleMessageSource and StaticMessageSource. The StaticMessageSource is rarely used but provides programmatic ways to add messages to the source. ResourceBundleMessageSource can be configured as shown below.


<bean id="messageSource"
class="org.springframework.context.support.ResourceBundleMessageSource">

Internationalization example in Spring

Here we’ll have two properties file format and error, for locale specific files you add the locale along with the file name. If you are having format file for locale UK (en_gb) and US (en_us) then you will create two files format_en_GB.properties and format_en_US.properties.

While defining the message source bean you just need to provide the base name (i.e. format or error) based on the passed locale correct properties file will be picked.

Configuration file


<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:aop="http://www.springframework.org/schema/aop"
    xmlns:context="http://www.springframework.org/schema/context"
    xmlns:c="http://www.springframework.org/schema/c"
    xmlns:p="http://www.springframework.org/schema/p"
    xsi:schemaLocation="http://www.springframework.org/schema/beans 
    http://www.springframework.org/schema/beans/spring-beans-4.0.xsd
    http://www.springframework.org/schema/context
    http://www.springframework.org/schema/context/spring-context.xsd">

<bean id="messageSource"
        class="org.springframework.context.support.ResourceBundleMessageSource">
<property name="basenames">
<list>
<value>config/format</value>
<value>config/error</value>
</list>
</property>
</bean>    
</beans>

Since the properties folder are inside config folder thus the name config/format and config/error.

format_en_GB.properties


dateformat=use date format dd/mm/yyyy

format_en_US.properties


dateformat=use date format mm/dd/yyyy

error_en_US.properties


argument.required=The {0} is required.

error_de.properties


argument.required=Der {0} ist erforderlich.

You can run it using the following code.


public class App {

 public static void main(String[] args) {

   AbstractApplicationContext context = new ClassPathXmlApplicationContext
           ("appcontext.xml");
   System.out.println("date format msg " + context.getMessage(
"dateformat", null, Locale.UK));

   System.out.println("date format msg " + context.getMessage("
    dateformat", null, Locale.US));

   System.out.println("Name error msg " + context.getMessage("argument.required", 
          new Object[]{"Name"}, Locale.US));
   System.out.println("Name error msg " + context.getMessage("argument.required", 
          new Object[]{"Name"}, Locale.GERMANY));

   context.close();
 }
}

Output


date format msg use date format dd/mm/yyyy
date format msg use date format mm/dd/yyyy
Name error msg The Name is required.
Name error msg Der Name ist erforderlich.

That's all for this topic Internationalization Using MessageSource in Spring. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Spring Tutorial Page

In this post we’ll see how to handle missing or corrupt blocks in HDFS and how to handle under replicated blocks in HDFS.

How to get information about corrupt or missing blocks

For getting information about corrupt or missing blocks in HDFS you can use hdfs fsck -list-corruptfileblocks command which prints out list of missing blocks and files they belong to.

Using that information you can decide how important the file is where you have missing blocks. Since the easiest way is to delete the file and copy it to HDFS again. If you are ok with deleting the files that have corrupt blocks you can use the following command.

hdfs fsck / -delete

This command deletes corrupted files.

If you still want to have a shot at fixing the blocks that are corrupted using the file names which you got from running the hdfs fsck -list-corruptfileblocks command you can use the following command.

hdfs fsck <path to file> -locations -blocks -files

This command prints out locations for every block. Using that information you can go the data nodes where block is stored. You can verify if there is any network or hardware related error or any file system problem and fixing that will make the block healthy again or not.

Fixing under replicated blocks problem

If you have under replicated blocks in HDFS for files then you can use hdfs fsck / command to get that information.

Then you can use the following script where hdfs dfs -setrep <replication number> command is used to set required replication factor for the files.


$ hdfs fsck / | grep 'Under replicated' | awk -F':''{print $1}'>> /tmp/files

$ for problemfile in `cat /tmp/files`; do echo "Setting replication for $problemfile"; hdfs dfs -setrep 3 $problemfile; done

Actually when you run hdfs fsck / command the output is in the following form for the under replicated blocks -

File name: Under replicated <block>.
   Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).

From this output using awk command you take the file name where word “Under replicated” is found and write them in a temp file. Then you set replication factor to 3 ( in this case) for those files.

That's all for this topic How to Handle Missing And Under Replicated Blocks in HDFS. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Hadoop Framework Page

In a MapReduce job in Hadoop you generally write both map function and reduce function. Map function to generate (key, value) pairs and reduce function to aggregate those (key, value) pairs but you may opt to have only the map function in your MapReduce job and skip the reducer part. That is known as a Mapper only job in Hadoop MapReduce.

Mapper only job

You may have a scenario where you just want to generate (key, value) pair in that case you can write a job with only map function. For example if you want to convert file to a binary file format like SequenceFile or to a columnar file format like Parquet.

Refer How to Read And Write SequenceFile in Hadoop to see how to convert text file to a sequence file using a mapper only job.

Note that, generally in a MapReduce job output of Mappers are written to local disk rather than in HDFS. In case of Mapper only job map output is written to HDFS which is one of the difference between a MapReduce job and a Mapper only job in Hadoop.

Writing Mapper only job

In order to write a mapper only job you need to set number of reducers as zero. You can do by adding job.setNumReduceTasks(0); in your driver class.

As example

@Override
public int run(String[] args) throws Exception {
 Configuration conf = getConf();
 Job job = Job.getInstance(conf, "TestClass");
 job.setJarByClass(getClass());
 job.setMapperClass(TestMapper.class);
 // Setting reducer to zero
 job.setNumReduceTasks(0);
 .....
 .....

}

Another way to have a Mapper only job is to pass the configuration parameter in the command line. Parameter used is mapreduce.job.reduces note that before Hadoop 2 parameter was mapred.reduce.tasks which is deprecated now.

As example-


hadoop jar /path/to/jar ClasstoRun -D mapreduce.job.reduces=0 /input/path /output/path

Mapper only job runs faster

The output of map job is partitioned and sorted on keys. Then it is sent across the network to the nodes where reducer is running. This whole shuffle phase can be avoided by having a Mapper only job in Hadoop making it faster.

That's all for this topic How to Write a Map Only Job in Hadoop MapReduce. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Hadoop Framework Page

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem (Hive, Hbase, MapReduce, Pig, Spark)

What is columnar storage format

In order to understand Parquet file format in Hadoop better, first let’s see what is columnar format. In a column oriented format values of each column of in the records are stored together.

For example if there is a record which comprises of ID, emp Name and Department then all the values for ID column will be stored together, values for Name column together and so on. If we take the same record schema as mentioned above having three fields ID (int), NAME (varchar) and Department (varchar)

ID	Name	Department
1	emp1	d1
2	emp2	d2
3	emp3	d3

For this table in a row wise storage format the data will be stored as follows-

emp1

emp2

emp3

Where as the same data will be stored as follows in a Column oriented storage format-

emp1

emp2

emp3

How it helps

As you can see from the storage formats, if you need to query few columns only like let’s say you want only the NAME column. In a row storage format each record in the dataset has to be loaded, parsed into fields and then data for Name is extracted. With column oriented format it can directly go to Name column as all the values for that columns are stored together and get those values. No need to go through the whole record.

So column oriented format increases the query performance as less seek time is required to go the required columns and less reading is required as it needs to read only the columns whose data is required.

If you see from BigData context, where generally data is loaded to Hadoop after denormalizing it so columns are generally more in number, using a columnar file format like parquet brings a lot of improvement in performance.

Another benefit that you get is in the form of less storage. Compression works better if data is of same type. With column oriented format columns of the same type are stored together resulting in better compression.

Parquet format

Coming back to parquet file format in Hadoop, since it is a column oriented format so it brings the same benefit of improved performance and better compression.

One of the unique feature of Parquet is that it can store data with nested structures also in columnar fashion. Other columnar file formats flatten the nested structures and store only the top level in columnar format. Which means in Parquet file format even the nested fields can be read individually with out the need to read all the fields in the nested structure.

Primitive data types in Parquet format

Data types supported by the Parquet file format are as follows

BOOLEAN: 1 bit boolean
INT32: 32 bit signed ints
INT64: 64 bit signed ints
INT96: 96 bit signed ints
FLOAT: IEEE 32-bit floating point values
DOUBLE: IEEE 64-bit floating point values
BYTE_ARRAY: arbitrarily long byte arrays.

Logical types

Parquet format also defines logical types that can be used to store data, by specifying how the primitive types should be interpreted. This keeps the set of primitive types to a minimum and reuses parquet’s efficient encoding. For example, strings are stored as byte arrays (binary) with a UTF8 annotation, DATE must annotate an int32. These annotations define how to further decode and interpret the data.

For example- Defining a String in Parquet

message p {
    required binary s (UTF8);
}

Defining a date field in Parquet.

message p {
  required int32 d (DATE);
}

You can get the full list of Parquet logical types here - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

Parquet file format

To understand the Parquet file format in Hadoop you should be aware of the following three terms-

Row group: A logical horizontal partitioning of the data into rows. A row group consists of a column chunk for each column in the dataset.
Column chunk: A chunk of the data for a particular column. These column chunks live in a particular row group and is guaranteed to be contiguous in the file.
Page: Column chunks are divided up into pages written back to back. The pages share a common header and readers can skip over page they are not interested in.

Parquet file format also has a header and footer. So the Parquet file format can be illustrated as follows.

Parquet File Format

Here Header just contains a magic number "PAR1" (4-byte) that identifies the file as Parquet format file.

Footer contains the following-

File metadata- The file metadata contains the locations of all the column metadata start locations. Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially. It also includes the format version, the schema, any extra key-value pairs.
length of file metadata (4-byte)
magic number "PAR1" (4-byte)

That's all for this topic Parquet File Format in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Hadoop Framework Page

This post shows how to use Hadoop Java API to read and write Parquet file.

You will need to put following jars in class path in order to read and write Parquet files in Hadoop.

parquet-hadoop-bundle-1.10.0.jar
parquet-avro-1.10.0.jar
jackson-mapper-asl-1.9.13.jar
jackson-core-asl-1.9.13.jar
avro-1.8.2.jar

Using Avro to define schema

Rather than creating Parquet schema and using ParquetWriter and ParquetReader to write and read file respectively it is more convenient to use a framework like Avro to create schema. Then you can use AvroParquetWriter and AvroParquetReader to write and read Parquet files. The mapping between Avro and Parquet schema and mapping between Avro record to Parquet record will be taken care of by these classes itself.

Writing Parquet file – Java program

First thing you’ll need is the schema, since Avro is used so you will have to define Avro schema.

EmpSchema.avsc

{
"type": "record",
"name": "empRecords",
"doc": "Employee Records",
"fields": 
  [{
"name": "id", 
"type": "int"

  }, 
  {
"name": "Name",
"type": "string"
  },
  {
"name": "Dept",
"type": "string"
  }
 ]
}

Java program

The task needed in the program are as follows-

First thing is to parse the schema.
Then create a generic record using Avro genric API.
Once you have the record write it to file using AvroParquetWriter.

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetWriter;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;

public class ParquetFileWrite {

    public static void main(String[] args) {
        // First thing - parse the schema as it will be used
        Schema schema = parseSchema();
        List<GenericData.Record> recordList = getRecords(schema);
        writeToParquet(recordList, schema);
    }

    private static Schema parseSchema() {
        Schema.Parser parser = new    Schema.Parser();
        Schema schema = null;
        try {
            // pass path to schema
            schema = parser.parse(ClassLoader.getSystemResourceAsStream(
"resources/EmpSchema.avsc"));

        } catch (IOException e) {
            e.printStackTrace();            
        }
        return schema;

    }

    private static List<GenericData.Record> getRecords(Schema schema){
        List<GenericData.Record> recordList = new ArrayList<GenericData.Record>();
        GenericData.Record record = new GenericData.Record(schema);
        // Adding 2 records
        record.put("id", 1);
        record.put("Name", "emp1");
        record.put("Dept", "D1");
        recordList.add(record);

        record = new GenericData.Record(schema);
        record.put("id", 2);
        record.put("Name", "emp2");
        record.put("Dept", "D2");
        recordList.add(record);

        return recordList;
    }


    private static void writeToParquet(List<GenericData.Record> recordList, Schema schema) {
        // Path to Parquet file in HDFS
        Path path = new Path("/test/EmpRecord.parquet");
        ParquetWriter<GenericData.Record> writer = null;
        // Creating ParquetWriter using builder
        try {
            writer = AvroParquetWriter.
<GenericData.Record>builder(path)
                .withRowGroupSize(ParquetWriter.DEFAULT_BLOCK_SIZE)
                .withPageSize(ParquetWriter.DEFAULT_PAGE_SIZE)
                .withSchema(schema)
                .withConf(new Configuration())
                .withCompressionCodec(CompressionCodecName.SNAPPY)
                .withValidation(false)
                .withDictionaryEncoding(false)
                .build();

            for (GenericData.Record record : recordList) {
                writer.write(record);
            }

        }catch(IOException e) {
            e.printStackTrace();
        }finally {
            if(writer != null) {
                try {
                    writer.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }
}

To run this Java program in Hadoop environment export the class path where your .class file for the Java program resides.

$ export HADOOP_CLASSPATH=/home/netjs/eclipse-workspace/bin

Then you can run the Java program using the following command.

$ hadoop org.netjs.ParquetFileWrite 

 18/07/05 19:56:41 INFO compress.CodecPool: Got brand-new compressor [.snappy] 
 18/07/05 19:56:41 INFO hadoop.InternalParquetRecordWriter:Flushing mem columnStore to file. allocated memory: 3072

Reading Parquet file – Java program

To read the parquet file created above you can use the following program.

import java.io.IOException;

import org.apache.avro.generic.GenericData;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.hadoop.ParquetReader;

public class ParquetFileRead {

    public static void main(String[] args) {
        readParquetFile();
    }

    private static void readParquetFile() {
        ParquetReader<GenericData.Record> reader = null;
        Path path =    new    Path("/test/EmpRecord.parquet");
        try {
            reader = AvroParquetReader
                    .<GenericData.Record>builder(path)
                    .withConf(new Configuration())
                    .build();
            GenericData.Record record;
            while ((record = reader.read()) != null) {
                System.out.println(record);
            }
        }catch(IOException e) {
            e.printStackTrace();
        }finally {
            if(reader != null) {
                try {
                    reader.close();
                } catch (IOException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }
            }
        }
    }
}

Using parquet-tools jar

You can also download parquet-tools jar and use it to see the content of a Parquet file, file metadata of the Parquet file, Parquet schema etc. As example to see the content of a Parquet file-

$ hadoop jar /parquet-tools-1.10.0.jar cat /test/EmpRecord.parquet

That's all for this topic How to Read And Write Parquet File in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Hadoop Framework Page

This post shows how to convert existing data to Parquet file format using MapReduce in Hadoop. In the example given here Text file is converted to Parquet file.

You will need to put following jars in class path in order to read and write Parquet files in Hadoop.

parquet-hadoop-bundle-1.10.0.jar
parquet-avro-1.10.0.jar
jackson-mapper-asl-1.9.13.jar
jackson-core-asl-1.9.13.jar
avro-1.8.2.jar

Using Avro to define schema

Rather than creating Parquet schema directly Avro framework is used to create schema as it is more convenient. Then you can use Avro API classes to write and read files respectively. The mapping between Avro and Parquet schema and mapping between Avro record to Parquet record will be taken care of by these classes itself.

MapReduce code to convert file to Parquet

In the code Avro schema is define inline. Program uses Avro genric API to create generic record. Also it’s a Mapper only job as just conversion is required, records are not aggregated.


import java.io.IOException;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.parquet.avro.AvroParquetOutputFormat;
import org.apache.parquet.example.data.Group;

public class ParquetConvert extends Configured implements Tool{

    /// Schema
    private static final Schema MAPPING_SCHEMA = new Schema.Parser().parse(
"{\n" +
"    \"type\":    \"record\",\n" +                
"    \"name\":    \"TextFile\",\n" +
"    \"doc\":    \"Text File\",\n" +
"    \"fields\":\n" + 
"    [\n" +  
"            {\"name\":    \"line\", \"type\":    \"string\"}\n"+
"    ]\n"+
"}\n");

    // Map function
    public static class ParquetConvertMapper extends Mapper<LongWritable, Text, Void, GenericRecord> {

        private GenericRecord record = new GenericData.Record(MAPPING_SCHEMA);
         public void map(LongWritable key, Text value, Context context) 
                 throws IOException, InterruptedException {
             record.put("line", value.toString());
             context.write(null, record); 
         }        
    }

    @Override
    public int run(String[] args) throws Exception {
        Job job = Job.getInstance(getConf(), "ParquetConvert");
        job.setJarByClass(getClass());
        job.setMapperClass(ParquetConvertMapper.class);    
        job.setNumReduceTasks(0);
        job.setOutputKeyClass(Void.class);
        job.setOutputValueClass(Group.class);
        job.setOutputFormatClass(AvroParquetOutputFormat.class);
        // setting schema
        AvroParquetOutputFormat.setSchema(job, MAPPING_SCHEMA);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
    }


    public static void main(String[] args) throws Exception{
        int exitFlag = ToolRunner.run(new ParquetConvert(), args);
        System.exit(exitFlag);
    }

}

On runnning the MapReduce code using the following command

hadoop jar /PATH_TO_JAR org.netjs.ParquetConvert /test/input /test/output

You can see that the Parquet file is written at the output location.

hdfs dfs -ls /test/output

Found 4 items
-rw-r--r--   1 netjs supergroup          0 2018-07-06 09:54 /test/output/_SUCCESS
-rw-r--r--   1 netjs supergroup        276 2018-07-06 09:54 /test/output/_common_metadata
-rw-r--r--   1 netjs supergroup        429 2018-07-06 09:54 /test/output/_metadata
-rw-r--r--   1 netjs supergroup        646 2018-07-06 09:54 /test/output/part-m-00000.parquet

Reading Parquet file using MapReduce

The following MapReduce program takes Parquet file as input and output a text file. In the Parquet file the records are in following format, so you need to write appropriate logic to extract the relevant part.

{"line": "Hello wordcount MapReduce Hadoop program."}
{"line": "This is my first MapReduce program."}
{"line": "This file will be converted to Parquet using MR."}


import java.io.IOException;

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.hadoop.example.ExampleInputFormat;

public class ParquetRead extends Configured implements Tool{
    // Map function
    public static class ParquetMapper extends Mapper<NullWritable, Group, NullWritable, Text> {
         public void map(NullWritable key, Group value, Context context) 
                 throws IOException, InterruptedException {
             NullWritable outKey = NullWritable.get();
             String line = value.toString();

             String[] fields = line.split(": ");
             context.write(outKey, new Text(fields[1]));

         }        
    }

    @Override
    public int run(String[] args) throws Exception {
        Job job = Job.getInstance(getConf(), "ParquetRead");
        job.setJarByClass(getClass());
        job.setMapperClass(ParquetMapper.class);    
        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);

        job.setInputFormatClass(ExampleInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception{
        int exitFlag = ToolRunner.run(new ParquetRead(), args);
        System.exit(exitFlag);
    }

}

If you want to read back the data you got using the writing to Parquet MapReduce program you can use the following command.


hadoop jar /PATH_TO_JAR org.netjs.ParquetRead /test/output/part-m-00000.parquet /test/out

That's all for this topic How to Read And Write Parquet File in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Hadoop Framework Page

While processing data using MapReduce you may want to break the requirement into a series of task and do them as a chain of MapReduce jobs rather than doing everything with in one MapReduce job and making it more complex. Hadoop provides two predefined classes ChainMapper and ChainReducer for the purpose of chaining MapReduce job in Hadoop.

ChainMapper class in Hadoop

Using ChainMapper class you can use multiple Mapper classes within a single Map task. The Mapper classes are invoked in a chained fashion, the output of the first becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will be written to the task's output.

For adding map tasks to the ChainedMapper addMapper() method is used.

ChainReducer class in Hadoop

Using the predefined ChainReducer class in Hadoop you can chain multiple Mapper classes after a Reducer within the Reducer task. For each record output by the Reducer, the Mapper classes are invoked in a chained fashion. The output of the reducer becomes the input of the first mapper and output of first becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will be written to the task's output.

For setting the Reducer class to the chain job setReducer() method is used.

For adding a Mapper class to the chain reducer addMapper() method is used.

How to chain MapReduce jobs

Using the ChainMapper and the ChainReducer classes it is possible to compose Map/Reduce jobs that look like [MAP+ / REDUCE MAP*].

In the chain of MapReduce job you can have-

A chain of map tasks executed using ChainMapper
A reducer set using ChainReducer.
A chain of map tasks added using ChainReducer (This step is optional).

Special care has to be taken when creating chains that the key/values output by a Mapper are valid for the following Mapper in the chain.

Benefits of using a chained MapReduce job

When MapReduce jobs are chained data from immediate mappers is kept in memory rather than storing to disk so that another mapper in chain doesn't have to read data from disk. Immediate benefit of this pattern is a dramatic reduction in disk IO.
Gives you a chance to break the problem into simpler tasks and execute them as a chain.

Chained MapReduce job example

Let’s take a simple example to show chained MapReduce job in action. Here input file has item, sales and zone columns in the below format (tab separated) and you have to get the total sales per item for zone-1.

Item1 345 zone-1
Item1 234 zone-2
Item3 654 zone-2
Item2 231 zone-3

For the sake of example let’s say in first mapper you get all the records, in the second mapper you filter them to get only the records for zone-1. In the reducer you get the total for each item and then you flip the records so that key become value and value becomes key. For that Inverse Mapper is used which is another predefined mapper in Hadoop.

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.chain.ChainMapper;
import org.apache.hadoop.mapreduce.lib.chain.ChainReducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.map.InverseMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class Sales extends Configured implements Tool{
    // First Mapper
    public static class CollectionMapper extends Mapper<LongWritable, Text, Text, Text>{
        private Text item = new Text();

        public void map(LongWritable key, Text value, Context context) 
                 throws IOException, InterruptedException {
            //splitting record
            String[] salesArr = value.toString().split("\t");
            item.set(salesArr[0]);
            // Writing (sales,zone) as value
            context.write(item, new Text(salesArr[1] + "," + salesArr[2]));
         }
    }

    // Mapper 2
    public static class FilterMapper extends Mapper<Text, Text, Text, IntWritable>{
        public void map(Text key, Text value, Context context) 
                 throws IOException, InterruptedException {

            String[] recordArr = value.toString().split(",");
            // Filtering on zone
            if(recordArr[1].equals("zone-1")) {
                Integer sales = Integer.parseInt(recordArr[0]);
                context.write(key, new IntWritable(sales));
            }
         }
    }

    // Reduce function
    public static class TotalSalesReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

        public void reduce(Text key, Iterable<IntWritable> values, Context context) 
                throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }      
            context.write(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception {
        int exitFlag = ToolRunner.run(new Sales(), args);
        System.exit(exitFlag);
    }

    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "Sales");
        job.setJarByClass(getClass());

        // MapReduce chaining
        Configuration mapConf1 = new Configuration(false);
        ChainMapper.addMapper(job, CollectionMapper.class, LongWritable.class, Text.class,
                   Text.class, Text.class,  mapConf1);

        Configuration mapConf2 = new Configuration(false);
        ChainMapper.addMapper(job, FilterMapper.class, Text.class, Text.class,
                   Text.class, IntWritable.class, mapConf2);

        Configuration reduceConf = new Configuration(false);        
        ChainReducer.setReducer(job, TotalSalesReducer.class, Text.class, IntWritable.class,
                 Text.class, IntWritable.class, reduceConf);

        ChainReducer.addMapper(job, InverseMapper.class, Text.class, IntWritable.class,
                 IntWritable.class, Text.class, null);

        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
    }

}

That's all for this topic Chaining MapReduce Job in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Hadoop Framework Page

Hadoop framework comes prepackaged with many Mapper and Reducer classes. This post explains some of these predefined Mappers and Reducers in Hadoop and shows examples using the predefined Mappers and Reducers classes.

Predefined Mapper classes in Hadoop

ChainMapper- The ChainMapper class allows to use multiple Mapper classes within a single Map task. Using this predefined class you can chain mapper classes where output of one map task becomes input of th second map task. That helps in breaking a complex task with lots of data processing into a chain of smaller tasks.
FieldSelectionMapper- This class implements a mapper class that can be used to perform field selections in a manner similar to unix cut. The input data is treated as fields separated by a user specified separator (the default value is "\t"). The user can specify a list of fields that form the map output keys, and a list of fields that form the map output values. The field separator is under attribute "mapreduce.fieldsel.data.field.separator" The map output field list spec is under attribute "mapreduce.fieldsel.map.output.key.value.fields.spec". The value is expected to be like "keyFieldsSpec:valueFieldsSpec" key/valueFieldsSpec are comma (,) separated field spec: fieldSpec,fieldSpec,fieldSpec ... Each field spec can be a simple number (e.g. 5) specifying a specific field, or a range (like 2-5) to specify a range of fields, or an open range (like 3-) specifying all the fields starting from field 3. By using this predefined class you don't need to write your own mapper with the split logic, you can configure FieldSelectionMapper with the required data to split the record.
InverseMapper- This predefined Mapper swaps keys and values.
TokenCounterMapper- Tokenize the input values and emit each word with a count of 1. This predefined class can be used where you want to do the sum of values like in a word count MapReduce program.
MultithreadedMapper- This Mapper is a Multithreaded implementation for org.apache.hadoop.mapreduce.Mapper. This predefined mapper is useful if your job is more I/O bound than CPU bound.
ValueAggregatorMapper- This class implements the generic mapper of Aggregate.
WrappedMapper- This predefined mapper wraps a given one to allow custom Mapper.Context implementations.
RegexMapper- A Mapper that extracts text matching a regular expression.

Predefined Reducer classes in Hadoop

ChainReducer- The ChainReducer class allows to chain multiple Mapper classes after a Reducer within the Reducer task. For each record output by the Reducer, the Mapper classes are invoked in a chained fashion. The output of the reducer becomes the input of the first mapper and output of first becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will be written to the task's output.
IntSumReducer- This predefined Reducer is used to sum the int values grouped with a key. You can use this predefined reducer where you want to get the sum of values grouped by keys.
LongSumReducer- This predefined Reducer is used to sum the long values grouped with a key.
FieldSelectionReducer- This class implements a reducer class that can be used to perform field selections in a manner similar to unix cut. The input data is treated as fields separated by a user specified separator (the default value is "\t"). The user can specify a list of fields that form the reduce output keys, and a list of fields that form the reduce output values. The fields are the union of those from the key and those from the value. The field separator is under attribute "mapreduce.fieldsel.data.field.separator" The reduce output field list spec is under attribute "mapreduce.fieldsel.reduce.output.key.value.fields.spec". The value is expected to be like "keyFieldsSpec:valueFieldsSpec" key/valueFieldsSpec are comma (,) separated field spec: fieldSpec,fieldSpec,fieldSpec ... As example: "4,3,0,1:6,5,1-3,7-". It specifies to use fields 4,3,0 and 1 for keys, and use fields 6,5,1,2,3,7 and above for values.
ValueAggregatorReducer- This class implements the generic reducer of Aggregate.
WrappedReducer- A Reducer which wraps a given one to allow for custom Reducer.Context implementations.

Predefined Mapper and Reducer class examples

Example 1- If you have to get few fields of the input file you can use FiledSelectionMapper for the same. Let’s say you have data in following format for item, zone and total sales.

Item1 zone-1 234
Item1 zone-2 456
Item3 zone-2 123

And you need to find total sales for each item which means you’ll have to extract field 0 and field 2 in your Mapper.

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.fieldsel.FieldSelectionHelper;
import org.apache.hadoop.mapreduce.lib.fieldsel.FieldSelectionMapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class SalesCalc extends Configured implements Tool {    

    // Reduce function
    public static class TotalSalesReducer extends Reducer<Text, Text, Text, IntWritable>{

        public void reduce(Text key, Iterable<Text> values, Context context) 
                throws IOException, InterruptedException {
            int sum = 0;
            for (Text val : values) {
                sum += Integer.parseInt(val.toString());
            }      
            context.write(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception {
        int exitFlag = ToolRunner.run(new SalesCalc(), args);
        System.exit(exitFlag);
    }

    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        // setting the separator
        conf.set(FieldSelectionHelper.DATA_FIELD_SEPERATOR, "\t");
        // Configure the fields that are to be extracted
        conf.set(FieldSelectionHelper.MAP_OUTPUT_KEY_VALUE_SPEC, "0:2");
        Job job = Job.getInstance(conf, "Sales");
        job.setJarByClass(getClass());
        // setting predefined FieldSelectionMapper
        job.setMapperClass(FieldSelectionMapper.class);    

        job.setReducerClass(TotalSalesReducer.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
    }
}

Refer Chaining MapReduce Job in Hadoop to see an example of chaining MapReduce job using Predefined ChainMapper and ChainReducer classes.

Example 2- You can write a word count MapReduce program using predefined TokenCounterMapper and IntSumReducer. In that case you don’t need to write any logic just configure these classes and run your MapReduce job.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class WordCount extends Configured implements Tool{

 public static void main(String[] args) throws Exception{
  int exitFlag = ToolRunner.run(new SimpleWordCount(), args);
  System.exit(exitFlag);

 }

 @Override
 public int run(String[] args) throws Exception {
  Configuration conf = new Configuration();
  Job job = Job.getInstance(conf, "WordCount");
  job.setJarByClass(getClass());
  // Setting pre-defing mapper and reducer
  job.setMapperClass(TokenCounterMapper.class);    
  job.setReducerClass(IntSumReducer.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(IntWritable.class);
      job.setInputFormatClass(TextInputFormat.class);
      job.setOutputFormatClass(TextOutputFormat.class);
      FileInputFormat.addInputPath(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
      return job.waitForCompletion(true) ? 0 : 1;
 }
}

That's all for this topic Predefined Mapper And Reducer Classes in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Hadoop Framework Page

In your Hadoop MapReduce job if you are wondering how to put logs or where to check MapReduce logs or even System.out statements then this post shows the same. Note that here accessing logs is shown for MapReuduce 2.

Location of logs in Hadoop MapReduce

An application ID is created for every MapReduce job. You can get that application ID from the console itself after starting your MapReduce job. It will be similar to as shown below.

18/07/11 14:39:23 INFO impl.YarnClientImpl: Submitted application application_1531299441901_0001

A folder with the same application ID will be created in the logs/userlogs of your Hadoop installation directory. For example I can see following directory for the application IDp mentioned above. HADOOP_INSTALLATION_DIR/logs/userlogs/application_1531299441901_0001

With in this directory you will find separate folders created for mappers and reducers and there you will have following files for logs and sysouts.

syslog- Contains the log messages.

sysout- Contains the System.out messages.

MapReduce example with logs

Here is a simple word count MapReduce program with logs and sysouts added.


import java.io.IOException;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class WordCount extends Configured implements Tool {
    public static final Log log = LogFactory.getLog(WordCount.class);
    // Map function
    public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
         private Text word = new Text();
         public void map(LongWritable key, Text value, Context context) 
                 throws IOException, InterruptedException {
             log.info("In map method");
             // Splitting the line on spaces
             String[] stringArr = value.toString().split("\\s+");
             System.out.println("Array length- " + stringArr.length);
             for (String str : stringArr) {
                 word.set(str);
                 context.write(word, new IntWritable(1));
             }

         }
    }

    // Reduce function
    public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

        private IntWritable result = new IntWritable();
        public void reduce(Text key, Iterable<IntWritable> values, Context context) 
                throws IOException, InterruptedException {
            log.info("In reduce method with key " + key);
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            System.out.println("Key - " + key + " sum - " + sum);
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        int exitFlag = ToolRunner.run(new WordCount(), args);
        System.exit(exitFlag);
    }

    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "WordCount");
        job.setJarByClass(getClass());
        job.setMapperClass(MyMapper.class);    
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
    }
}

Once you run this MapReduce job, using the application ID you can go to the location as already explained above and check the log and sysout messages.

That's all for this topic How to Check Hadoop MapReduce Logs. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Hadoop Framework Page

When you run a MapReduce job and mappers start producing output internally lots of processing is done by the Hadoop framework before the reducers get their input. This whole internal processing is known as shuffle phase in Hadoop framework.

The tasks done internally by Hadoop framework with in the shuffle phase are as follows-

Data from mappers is partitioned as per the number of reducers.
Data is also sorted by keys with in a partition.
Output from Maps is written to disk as may temporary files.
Once the map task is finished all the files written to the disk are merged to create a single file.
Data from a particular partition (from all mappers) is transferred to a reducer that is suppose to process that particular partition.
If data transferred to a reducer exceeded the memory limit then it is copied to a disk.
Once reducer has got its portion of data from all the mappers data is again merged while still maintaining the sort order of keys to create reduce task input.

As you can see some of the shuffle phase tasks happen at the nodes where mappers are running and some of them at the nodes where reducers are running.

Shuffle phase process at mappers side

When the map task starts producing output it is not directly written to disk instead there is a memory buffer (size 100 MB by default) where map output is kept. This size is configurable and parameter that is used is – mapreduce.task.io.sort.mb

When that data from memory is spilled to disk is controlled by the configuration parameter mapreduce.map.sort.spill.percent (default is 80% of the memory buffer). Once this threshold of 80% is reached, a thread will begin to spill the contents to disk in the background.

Before writing to the disk the Mapper outputs are sorted and then partitioned per Reducer. The total number of partitions is the same as the number of reduce tasks for the job. For example let's say there are 4 mappers and 2 reducers for a MapReduce job. Then output of all of these mappers will be divided into 2 partitions one for each reducer.

If there is a Combiner that is also executed in order to reduce the size of data written to the disk.

This process of keeping data into memory until threshold is reached, partitioning and sorting, creating a new spill file every time threshold is reached and writing data to the disk is repeated until all the records for the particular map tasks are processed. Before the Map task is finished all these spill files are merged, keeping the data partitioned and sorted by keys with in each partition, to create a single merged file.

Following image illustrates the shuffle phase process at the Map end.

Shuffle phase process at Reducer side

By this time you have the Map output ready and stored on a local disk of the node where Map task was executed. Now the relevant partition of the output of all the mappers has to be fetched by the framework to the nodes where reducers are running.

Reducers don’t wait for all the map tasks to finish to start copying the data, as soon as a Map task is finished data transfer from that node is started. For example if there are 10 mappers running, framework won’t wait for all the 10 mappers to finish to start map output transfer. As soon as a map task finishes transfer of data starts.

Data copied from mappers is kept is memory buffer at the reducer side too. The size of the buffer is configured using the following parameter.

mapreduce.reduce.shuffle.input.buffer.percent- The percentage of memory- relative to the maximum heapsize as typically specified in mapreduce.reduce.java.opts- that can be allocated to storing map outputs during the shuffle. Default is 70%.

When the buffer reaches a certain threshold map output data is merged and written to disk.

This merging of Map outputs is known as sort phase. During this phase the framework groups Reducer inputs by keys since different mappers may have output the same key.

The threshold for triggering the merge to disk is configured using the following parameter.

mapreduce.reduce.merge.inmem.thresholds- The number of sorted map outputs fetched into memory before being merged to disk. In practice, this is usually set very high (1000) or disabled (0), since merging in-memory segments is often less expensive than merging from disk.

The merged file, which is the combination of data written to the disk as well as the data still kept in memory constitutes the input for Reduce task.

Points to note-

The Mapper outputs are sorted and then partitioned per Reducer.
The total number of partitions is the same as the number of reduce tasks for the job.
Reducer has 3 primary phases: shuffle, sort and reduce.
Input to the Reducer is the sorted output of the mappers.
In shuffle phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
In sort phase the framework groups Reducer inputs by keys from different map outputs.
The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.

That's all for this topic Shuffle And Sort Phases in Hadoop MapReduce. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Hadoop Framework Page

Apache Avro file format created by Doug cutting is a data serialization system for Hadoop. Avro provides simple integration with dynamic languages. Avro implementations for C, C++, C#, Java, PHP, Python, and Ruby are available.

Avro file

Avro file has two things-

Data definition (Schema)
Data

Both data definition and data are stored together in one file. With in the Avro data there is a header, in that there is a metadata section where the schema is stored. All objects stored in the file must be written according to that schema.

Avro Schema

Avro relies on schemas for reading and writing data. Avro schemas are defined with JSON that helps in data interoperability. Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed).

While defining schema you can write it in a separate file having .avsc extension.

Avro Data

Avro data is serialized and stored in binary format which makes for a compact and efficient storage. Avro data itself is not tagged with type information because the schema used to write data is always available when the data is read. The schema is required to parse data. This permits each datum to be written with no per-value overheads, making serialization both fast and small.

Refer How to Read And Write Avro File in Hadoop to see how to read and write Avro file using Java API in Hadoop.

Avro file format

Avro specifies an object container file format. A file has a schema, and all objects stored in the file must be written according to that schema, using binary encoding.

Objects are stored in blocks that may be compressed. Syncronization markers are used between blocks to permit efficient splitting of files for MapReduce processing.

A file consists of:

A file header, followed by
one or more file data blocks

Following image shows the Avro file format.

Header

Data block

…….

Avro file header consists of:

Four bytes, ASCII 'O', 'b', 'j', followed by 1.
File metadata, including the schema.
The 16-byte, randomly-generated sync marker for this file.

A file header is thus described by the following schema:


{"type": "record", "name": "org.apache.avro.file.Header",
"fields" : [
   {"name": "magic", "type": {"type": "fixed", "name": "Magic", "size": 4}},
   {"name": "meta", "type": {"type": "map", "values": "bytes"}},
   {"name": "sync", "type": {"type": "fixed", "name": "Sync", "size": 16}},
  ]
}

A file data block consists of:

A long indicating the count of objects in this block.
A long indicating the size in bytes of the serialized objects in the current block, after any codec is applied
The serialized objects. If a codec is specified, this is compressed by that codec.
The file's 16-byte sync marker.

How schema is defined in Avro

Avro Schema is defined using JSON and consists of-

A JSON string, naming a defined type.
A JSON object, of the form: {"type": "typeName" ...attributes...}
where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.
A JSON array, representing a union of embedded types.

Primitive Types in Avro

Primitive types used in Avro are as follows-

null: no value
boolean: a binary value
int: 32-bit signed integer
long: 64-bit signed integer
float: single precision (32-bit) IEEE 754 floating-point number
double: double precision (64-bit) IEEE 754 floating-point number
bytes: sequence of 8-bit unsigned bytes
string: unicode character sequence

As example if you are defining field of type String


 {"name": "personName",  "type": "string"}

Complex Types in Avro

Avro supports six kinds of complex types: record, enum, array, map, union and fixed.

record- Records are defined using the type name "record" and support following attributes:

name- A JSON string providing the name of the record, this is a required attribute.
doc- A JSON string providing documentation to the user of this schema, this is an optional attribute.
aliases- A JSON array of strings, providing alternate names for this record, this is an optional attribute.
fields- A JSON array, listing fields, this is a required attribute. Each field in Record is a JSON object with the following attributes:
- name- A JSON string providing the name of the field, this is a required attribute.
- doc- A JSON string describing this field for users, this is an optional attribute.
- type- A JSON object defining a schema, or a JSON string naming a record definition, this is a required attribute.
- default- A default value for this field, used when reading instances that lack this field, this is an optional attribute.
- order- Specifies how this field impacts sort ordering of this record, this is an optional attribute. Valid values are "ascending" (the default), "descending", or "ignore".
- aliases- A JSON array of strings, providing alternate names for this field, this is an optional attribute.

As example schema for Person having Id, Name and Address fields.


{
"type": "record",
"name": "PersonRecord",
"doc": "Person Record",
"fields": [
  {"name":"Id",  "type":"long"},
  {"name":"Name",  "type":"string"},
  {"name":"Address",   "type":"string"}
 ]
}

enum- Enums use the type name "enum" and support the following attributes:

name- A JSON string providing the name of the enum, this is a required attribute. namespace, a JSON string that qualifies the name;
aliases- A JSON array of strings, providing alternate names for this enum, this is an optional attribute.
doc- a JSON string providing documentation to the user of this schema, this is an optional attribute.
symbols- A JSON array, listing symbols, as JSON strings, this is a required attribute. All symbols in an enum must be unique; duplicates are prohibited.

For example, four seasons can be defined as:


{ "type": "enum",
"name": "Seasons",
"symbols" : ["WINTER", "SPRING", "SUMMER", "AUTUMN"]
}

array- Arrays use the type name "array" and support a single attribute:

items- The schema of the array's items.

For example, an array of strings is declared with:


{"type": "array", "items": "string"}

map- Maps use the type name "map" and support one attribute:

values- The schema of the map's values.

Map keys are assumed to be strings. For example, a map from string to int is declared with:


{"type": "map", "values": "int"}

union- Unions are represented using JSON arrays. For example, ["null", "string"] declares a schema which may be either a null or string. Avro data confirming to this union should match one of the schemas represented by union.

fixed- Fixed uses the type name "fixed" and supports following attributes:

name- A string naming this fixed, this is a required attribute. namespace, a string that qualifies the name;
aliases- A JSON array of strings, providing alternate names for this enum, this is an optional attribute.
size- An integer, specifying the number of bytes per value, this is a required attribute.

For example, 16-byte quantity may be declared with:


{"type": "fixed", "size": 16, "name": "md5"}

Data encoding in Avro

Avro specifies two serialization encodings: binary and JSON. Most applications will use the binary encoding, as it is smaller and faster.

Reference: https://avro.apache.org/docs/1.8.2/index.html

That's all for this topic Apache Avro Format in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Hadoop Framework Page

In this post we’ll see a Java program to read and write Avro files in Hadoop environment.

Jars download

For reading and writing an Avro file using Java API you will need to download following jars and add them to your project's classpath.

avro-1.8.2.jar
avro-tools-1.8.2.jar

The Avro Java implementation also depends on the Jackson JSON library. so you'll also need

jackson-mapper-asl-1.9.13.jar
jackson-core-asl-1.9.13.jar

Writing Avro file – Java program

To write an Avro file using Java API steps are as following.

You need an Avro schema.
In your program you will have to parse that scema.
Then you need to create records referring that parsed schema.
Write those records to file.

Avro Schema

Avro schema used for the program is called Person.avsc and it resides in the folder resources with in the project structure.


{
"type": "record",
"name": "personRecords",
"doc": "Personnel Records",
"fields": 
  [{
"name": "id", 
"type": "int"

  }, 
  {
"name": "Name",
"type": "string"
  },
  {
"name": "Address",
"type": "string"
  }
 ]
}

Java Code


import java.io.IOException;
import java.io.OutputStream;
import java.net.URI;

import org.apache.avro.Schema;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumWriter;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class AvroFileWrite {

    public static void main(String[] args) {
        Schema schema = parseSchema();
        writetoAvro(schema);
    }


    // parsing the schema
    private static Schema parseSchema() {
        Schema.Parser parser = new Schema.Parser();
        Schema schema = null;
        try {
            // Path to schema file
            schema = parser.parse(ClassLoader.getSystemResourceAsStream(
"resources/Person.avsc"));

        } catch (IOException e) {
            e.printStackTrace();            
        }
        return schema;        
    }

    private static void writetoAvro(Schema schema) {
        GenericRecord person1 = new GenericData.Record(schema);
        person1.put("id", 1);
        person1.put("Name", "Jack");
        person1.put("Address", "1, Richmond Drive");

        GenericRecord person2 = new GenericData.Record(schema);
        person2.put("id", 2);
        person2.put("Name", "Jill");
        person2.put("Address", "2, Richmond Drive");

        DatumWriter<GenericRecord> datumWriter = new 
                     GenericDatumWriter<GenericRecord>(schema);
        DataFileWriter<GenericRecord> dataFileWriter = null;
        try {
            //out file path in HDFS
            Configuration conf = new Configuration();
            // change the IP
            FileSystem fs = FileSystem.get(URI.create(
"hdfs://124.32.45.0:9000/test/out/person.avro"), conf);
            OutputStream out = fs.create(new Path(
"hdfs://124.32.45.0:9000/test/out/person.avro"));

            dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
            // for compression
            //dataFileWriter.setCodec(CodecFactory.snappyCodec());
            dataFileWriter.create(schema, out);

            dataFileWriter.append(person1);
            dataFileWriter.append(person2);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }finally {
            if(dataFileWriter != null) {
                try {
                    dataFileWriter.close();
                } catch (IOException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }
            }

        }            
    }
}

To run this Java program in Hadoop environment export the class path where your .class file for the Java program resides.


$ export HADOOP_CLASSPATH=/home/netjs/eclipse-workspace/bin

Then you can run the Java program using the following command.


$ hadoop org.netjs.AvroFileWrite

Reading Avro file – Java program

If you want to read back the Avro file written in the above program.


import java.io.IOException;

import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.mapred.FsInput;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;

public class AvroFileRead {

    public static void main(String[] args) {
        Schema schema = parseSchema();
        readFromAvroFile(schema);
    }

    // parsing the schema
    private static Schema parseSchema() {
        Schema.Parser parser = new    Schema.Parser();
        Schema schema = null;
        try {
            // Path to schema file
            schema = parser.parse(ClassLoader.getSystemResourceAsStream(
"resources/Person.avsc"));

        } catch (IOException e) {
            e.printStackTrace();            
        }
        return schema;        
    }

    private static void readFromAvroFile(Schema schema) {

        Configuration conf = new Configuration();
        DataFileReader<GenericRecord> dataFileReader = null;
        try {
            // change the IP
            FsInput in = new FsInput(new Path(
"hdfs://124.32.45.0:9000/user/out/person.avro"), conf);
            DatumReader<GenericRecord> datumReader = new 
                     GenericDatumReader<GenericRecord>(schema);
            dataFileReader = new DataFileReader<GenericRecord>(in, datumReader);
            GenericRecord person = null;
            while (dataFileReader.hasNext()) {
                person = dataFileReader.next(person);
                System.out.println(person);
            }
        }catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }finally {
            if(dataFileReader != null) {
                try {
                    dataFileReader.close();
                } catch (IOException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }
            }

        }       
    }
}

You can run the Java program using the following command.


$ hadoop org.netjs.AvroFileRead

{"id": 1, "Name": "Jack", "Address": "1, Richmond Drive"}
{"id": 2, "Name": "Jill", "Address": "2, Richmond Drive"}

That's all for this topic How to Read And Write Avro File in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Hadoop Framework Page

In this post we’ll see how to use Avro file with Hadoop MapReduce.

Avro MapReduce jar

You will need to download following jar and put it into project’s class path.

avro-mapred-1.8.2.jar

Avro MapReduce

In this MapReduce program we have to get total sales per item and the output of MapReduce is an Avro file. Records are in the given tab separated format.


Item1 345 zone-1
Item1 234 zone-2
Item3 654 zone-2
Item2 231 zone-3

MapReduce code


import java.io.IOException;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.mapred.AvroKey;
import org.apache.avro.mapred.AvroValue;
import org.apache.avro.mapreduce.AvroJob;
import org.apache.avro.mapreduce.AvroKeyOutputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class AvroMR extends Configured implements Tool{
    /// Schema
    private static final Schema SALES_SCHEMA = new Schema.Parser().parse(
"{\n" +
"    \"type\":    \"record\",\n" +                
"    \"name\":    \"SalesRecord\",\n" +
"    \"doc\":    \"Sales Records\",\n" +
"    \"fields\":\n" + 
"    [\n" + 
"            {\"name\": \"item\",    \"type\":    \"string\"},\n"+ 
"            {\"name\":    \"totalsales\", \"type\":    \"int\"}\n"+
"    ]\n"+
"}\n");

    //Mapper
    public static class ItemMapper extends Mapper<LongWritable, Text, AvroKey<Text>, 
            AvroValue<GenericRecord>>{
        private Text item = new Text();
        private GenericRecord record = new GenericData.Record(SALES_SCHEMA);
        public void map(LongWritable key, Text value, Context context) 
                 throws IOException, InterruptedException {
            //splitting record
            String[] salesArr = value.toString().split("\t");        
            item.set(salesArr[0]);
            record.put("item", salesArr[0]);
            record.put("totalsales", Integer.parseInt(salesArr[1]));
            context.write(new AvroKey<Text>(item), new AvroValue<GenericRecord>(record));
         }
    }

    // Reducer
    public static class SalesReducer extends Reducer<AvroKey<Text>, AvroValue<GenericRecord>, 
                      AvroKey<GenericRecord>, NullWritable>{       
        public void reduce(AvroKey<Text> key, Iterable<AvroValue<GenericRecord>> values, 
              Context context) throws IOException, InterruptedException {
          int sum = 0;
          for (AvroValue<GenericRecord> value : values) {
              GenericRecord    record = value.datum();
              sum += (Integer)record.get("totalsales");
          }
          GenericRecord record = new GenericData.Record(SALES_SCHEMA);
          record.put("item", key.datum());
          record.put("totalsales", sum);
          context.write(new AvroKey<GenericRecord>(record), NullWritable.get());
        }
    }

    public static void main(String[] args) throws Exception{
        int exitFlag = ToolRunner.run(new AvroMR(), args);
        System.exit(exitFlag);
    }

    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "AvroMR");
        job.setJarByClass(getClass());
        job.setMapperClass(ItemMapper.class);    
        job.setReducerClass(SalesReducer.class);
        AvroJob.setMapOutputKeySchema(job, Schema.create(Schema.Type.STRING));
        AvroJob.setMapOutputValueSchema(job, SALES_SCHEMA);
        AvroJob.setOutputKeySchema(job, SALES_SCHEMA);    
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(AvroKeyOutputFormat.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
    }
}

Running the MapReduce code using the following command.

hadoop jar /home/netjs/netjshadoop.jar org.netjs.AvroMR /test/input/sales.txt /test/out/sales

That creates an Avro file as output, to see the content of the output file you can use the following command.

hadoop jar /PATH_TO_JAR/avro-tools-1.8.2.jar tojson /test/out/sales/part-r-00000.avro

That's all for this topic Using Avro File With Hadoop MapReduce. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Hadoop Framework Page

GenericOptionsParser is a utility class in Hadoop which resides in org.apache.hadoop.utilpackage. GenericOptionsParser class helps in setting options through command line. It parses the command line arguments and sets them on a configuration object that can then be used in the application.

How GenericOptionsParser class is used

Rather than using GenericOptionsParser class directly generally you will implement Tool interface in your MapReduce class and use ToolRunner.run method to run your application which will use GenericOptionsParser internally to parse the command line arguments.

How GenericOptionsParser class helps

If you set configuration arguments with in your code then you are hard coding those arguments. Any change in any argument will require code change and recreation of jar.

Passing argument in command line gives the flexibility to add, reduce or change arguments without requiring any change in the code.

Generic Options

You can specify command line arguments using the following generic options.

-archives <comma separated list of archives>- Specify comma separated archives to be unarchived on the compute machines. Applies only to job.
-conf <configuration file>- Specify an application configuration file.
-D <property>=<value>- Use value for given property.
-files <comma separated list of files>- Specify comma separated files to be copied to the map reduce cluster. Applies only to job.
-fs <file:///> or <hdfs://namenode:port>- Specify default filesystem URL to use. Overrides ‘fs.defaultFS’ property from configurations.
-jt <local> or <resourcemanager:port>- Specify a ResourceManager. Applies only to job.
-libjars <comma seperated list of jars>- Specify comma separated jar files to include in the classpath. Applies only to job.

GenericOptionParser with ToolRunner example

In the post Using Avro File With Hadoop MapReduce there is an example of using Avro file with MapReduce. In that example Avro schema is inlined with in the code.

Here the same example is written by passing that schema file (saleschema.avsc) as a command line argument.

saleschema.avsc


{
"type": "record",    
"name": "SalesRecord",
"doc" : "Sales Records",
"fields": 
 [
  {"name":"item", "type": "string"},
   {"name":"totalsales", "type": "int"}
 ]
}

MapReduce code


import java.io.File;
import java.io.IOException;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.mapred.AvroKey;
import org.apache.avro.mapred.AvroValue;
import org.apache.avro.mapreduce.AvroJob;
import org.apache.avro.mapreduce.AvroKeyOutputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class AvroMR extends Configured implements Tool{

    //Mapper
    public static class ItemMapper extends Mapper<LongWritable, Text, AvroKey<Text>, 
          AvroValue<GenericRecord>>{
        private Text item = new Text();
        private GenericRecord record;
         @Override
        protected void setup(Context context)
                throws IOException, InterruptedException {
            // Getting the file passed as arg in command line
             Schema SALES_SCHEMA = new Schema.Parser().parse(new File("saleschema.avsc"));
             record = new GenericData.Record(SALES_SCHEMA);
        }
        public void map(LongWritable key, Text value, Context context) 
                 throws IOException, InterruptedException {
            //splitting record
            String[] salesArr = value.toString().split("\t");        
            item.set(salesArr[0]);
            record.put("item", salesArr[0]);
            record.put("totalsales", Integer.parseInt(salesArr[1]));
            context.write(new AvroKey<Text>(item), new AvroValue<GenericRecord>(record));
         }
    }

    // Reducer
    public static class SalesReducer extends Reducer<AvroKey<Text>, AvroValue<GenericRecord>, 
               AvroKey<GenericRecord>, NullWritable>{    
        Schema SALES_SCHEMA;
        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            // Getting the file passed as arg in command line
            SALES_SCHEMA = new Schema.Parser().parse(new File("saleschema.avsc"));
        }
        public void reduce(AvroKey<Text> key, Iterable<AvroValue<GenericRecord>> values,
            Context context) throws IOException, InterruptedException {
          int sum = 0;
          for (AvroValue<GenericRecord> value : values) {
              GenericRecord    record = value.datum();
              sum += (Integer)record.get("totalsales");
          }
          GenericRecord record = new GenericData.Record(SALES_SCHEMA);
          record.put("item", key.datum());
          record.put("totalsales", sum);
          context.write(new AvroKey<GenericRecord>(record), NullWritable.get());
        }
    }

    public static void main(String[] args) throws Exception{
        int exitFlag = ToolRunner.run(new AvroMR(), args);
        System.exit(exitFlag);
    }

    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        Job job = Job.getInstance(conf, "AvroMR");
        job.setJarByClass(getClass());
        job.setMapperClass(ItemMapper.class);    
        job.setReducerClass(SalesReducer.class);
        AvroJob.setMapOutputKeySchema(job, Schema.create(Schema.Type.STRING));
        // Schema file needed here also
        Schema SALES_SCHEMA = new Schema.Parser().parse(
            new File("/home/netjs/saleschema.avsc"));
        AvroJob.setMapOutputValueSchema(job, SALES_SCHEMA);
        AvroJob.setOutputKeySchema(job,    SALES_SCHEMA);    
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(AvroKeyOutputFormat.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
    }
}

Running this Hadoop MapReduce program with schema file passed as command line argument.


hadoop jar /home/netjs/netjshadoop.jar org.netjs.AvroMR -files /home/netjs/saleschema.avsc /test/input/sales.txt /test/out/sale

Here location of the schema file in the local file system is passed as a command line argument.

You can see the content of Avro output file using the avro-tools jar


hadoop jar /PATH_TO_JAR/avro-tools-1.8.2.jar tojson /test/out/sale/part-r-00000.avro

{"item":"Item1","totalsales":1158}
{"item":"Item2","totalsales":642}
{"item":"Item3","totalsales":1507}

That's all for this topic ToolRunner and GenericOptions in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Hadoop Framework Page

If you run a MapReduce job you would have seen a lot of counters displayed on the console after the MapReduce job is finished (You can also check the counters using UI while the job is running). These counters in Hadoop MapReduce give a lot of statistical information about the executed job. Apart from giving you the information about the tasks these counters also help you in diagnosing the problems in MapReduce job, improving the MapReduce performance.

For example you get information about the spilled records and memeory usage which gives you an indicator about the performance of your MapReduce job.

Types of counters in Hadoop

There are 2 types of Counters in Hadoop MapReduce.

Built-In Counters
User-Defined Counters or Custom counters

Built-In Counters in MapReduce

Hadoop Framework has some built-in counters which give information pertaining to-

File system like bytes read, bytes written.
MapReduce job like launched map and reduce tasks
MapReduce task like map input records, combiner output records.

These built-in counters are grouped based on the type of information they provide and represented by Enum classes in Hadoop framework. Following is the list of the Counter groups and the corresponding Enum class names.

File System Counters – org.apache.hadoop.mapreduce.FileSystemCounter
Job Counters– org.apache.hadoop.mapreduce.JobCounter
Map-Reduce Framework Counters– org.apache.hadoop.mapreduce.TaskCounter
File Input Format Counters– org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
File Output Format Counters– org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter

File System Counters in MapReduce

File system counters will be repeated for each type of file system, prefixed with the file system for each entry. As example FILE: Number of bytes read, HDFS: Number of bytes read.

Number of bytes read- Displays the number of bytes read by the file system for both Map and Reduce tasks.
Number of bytes written- Displays the number of bytes read by the file system for both Map and Reduce tasks.
Number of read operations- Displays the number of read operations by both Map and Reduce tasks.
Number of large read operations- Displays the number of large read operations (example: traversing the directory tree) for both Map and Reduce tasks.
Number of write operations- Displays the number of write operations by both Map and Reduce tasks.

Job Counters in MapReduce

These counters give information about the whole job not at the task level.

Launched map tasks- Displays the number of launched map tasks.
Launched reduce tasks- Displays the number of launched reduce tasks.
Launched uber tasks- Displays the number of tasks launched as uber tasks.
Data-local map tasks- Displays the number of mappers run on the same node where the input data they have to process resides.
Rack-local map taks- Displays the number of mappers run on the node on the same rack where the input data they have to process resides.
Map in uber tasks- Displays the number of maps run as uber tasks.
Reduce in uber tasks- Displays the number of reducers run as uber tasks.
Total time spent by all map tasks - Total time in miliseconds running all the launched map tasks.
Total time spent by all reduce tasks- Total time in miliseconds running all the launched reducde tasks.
Failed map tasks- Displays the number of map tasks that failed.
Failed reduce tasks- Displays the number of reduce tasks that failed.
Failed uber tasks- Displays the number of uber tasks that failed.
Killed map tasks- Displays the number of killed map tasks.
Killed reduce tasks- Displays the number of killed reduce tasks.

Map-Reduce Framework Counters

These counters collect information about the running task.

Map input records– Displays the number of records processed by all the maps in the MR job.
Map output records– Displays the number of output records produced by all the maps in the MR job.
Map skipped records– Displays the number of records skipped by all the maps.
Map output bytes– Displays the number of bytes produced by all the maps in the MR job.
Map output materialized bytes– Displays the Map output bytes written to the disk.
Reduce input groups– Displays the number of key groups processed by all the Reducers.
Reduce shuffle bytes– Displays the number of bytes of Map output copied to Reducers in shuffle process.
Reduce input records– Displays the number of input records processed by all the Reducers.
Reduce output records– Displays the number of output records produced by all the Reducers.
Reduce skipped records– Displays the number of records skipped by Reducer.
Input split bytes– Displays the data about input split objects in bytes.
Combine input records– Displays the number of input records processed by combiner.
Combine output records– Displays the number of output records produced by combiner.
Spilled Records– Displays the number of records spilled to the disk by all the map and reduce tasks.
Shuffled Maps– Displays the number of map output files transferred during shuffle process to nodes where reducers are running.
Failed Shuffles– Displays the number of map output files failed during shuffle.
Merged Map outputs– Displays the number of map outputs merged after map output is transferred.
GC time elapsed– Displays the garbage collection time in mili seconds.
CPU time spent– Displays the CPU processing time spent in mili seconds.
Physical memory snapshot– Displays the total physical memory used in bytes.
Virtual memory snapshot– Displays the total virtual memory used in bytes.
Total committed heap usage– Displays the total amount of heap memory available in bytes.

File Input Format Counters in MapReduce

Bytes Read– Displays the bytes read by Map tasks using the specified Input format.

File Output Format Counters in MapReduce.

Bytes Written– Displays the bytes written by Map and reduce tasks using the specified Output format.

User defined counters in MapReduce

You can also create user defined counters in Hadoop using Java enum. The name of the Enum becomes the counter group's name where as each field in enum is a counter name.

You can increment these counters in the mapper or reducer based on some logic that will help you with debugging. User defined counters are also aggregated across all the mappers or reducers by the Hadoop framework and displayed as a single unit.

User defined counter example

Suppose you have data is following format and in some records sales data is missing.


Item1 345 zone-1
Item1  zone-2
Item3 654 zone-2
Item2 231 zone-3

Now you want to determine the number of records where sales data is missing to get a picture how much skewness is happening in your analysis because of missing fields.

MapReduce code


import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class SalesCalc extends Configured implements Tool {    
    enum Sales {
        SALES_DATA_MISSING
    }
    // Mapper
    public static class SalesMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
        private Text item = new Text();
        IntWritable sales = new IntWritable();
        public void map(LongWritable key, Text value, Context context) 
                 throws IOException, InterruptedException {
             // Splitting the line on tab
             String[] salesArr = value.toString().split("\t");
             item.set(salesArr[0]);

             if(salesArr[1] != null && !salesArr[1].trim().equals("")) {
                 sales.set(Integer.parseInt(salesArr[1]));
             }else {
                 // incrementing counter
                 context.getCounter(Sales.SALES_DATA_MISSING).increment(1);
                 sales.set(0);
             }

             context.write(item, sales);
         }
    }

    // Reducer
    public static class TotalSalesReducer extends Reducer<Text, Text, Text, IntWritable>{

        public void reduce(Text key, Iterable<IntWritable> values, Context context) 
                throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }      
            context.write(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception {
        int exitFlag = ToolRunner.run(new SalesCalc(), args);
        System.exit(exitFlag);
    }

    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        Job job = Job.getInstance(conf, "SalesCalc");
        job.setJarByClass(getClass());
        job.setMapperClass(SalesMapper.class);    
        job.setReducerClass(TotalSalesReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
    }
}

In the counters displayed for the MapReduce job you can see the counter defined for getting the number of fields where sales numbers are missing.


org.netjs.SalesCalc$Sales
        SALES_DATA_MISSING=4

That's all for this topic What Are Counters in Hadoop MapReduce. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Hadoop Framework Page

In this post we’ll see what is combiner in Hadoop and how combiner helps in speeding up the shuffle and sort phase in Hadoop MapReduce.

What is combiner in Hadoop

Generally in a MapReduce job, data is collated in the Map phase and later aggregated in reduce phase. By specifying a combiner function in MapReduce you can aggregate data at the Map phase also.

You can specify a combiner in your MapReduce driver using the following statement -


job.setCombinerClass(COMBINER_CLASS.class);

Note that specifying combiner in your MapReduce job is optional.

How combiner helps in improving MapReduce performance

Once the Map tasks start producing output that data has to be stored in memory, partitioned as per the number of reducers, sorted on keys and then spilled to the disk.

Once the Map task is done the data partitions have to be sent to the reducers (on different nodes) working on specific partitions. As you can see this whole shuffle and sort process involves consuming memory, I/O and data transfer across network.

If you specify a combiner function in MapReduce, when the map output stored in memory is written to disk, combiner function is run on the data so that there is less data to be written to the disk (reducing I/O) which also results in less data being transferred to reducer nodes (reducing bandwidth).

For example– Suppose you have sales data of several items and you are trying to find the maximum sales number per item. For Item1 if following (key,value) pair are the output of Map-1 and Map-2.

Map-1

(Item1, 300)
(Item1, 250)
(Item1, 340)
(Item1, 450)

Map-2

(Item1, 120)
(Item1, 540)
(Item1, 290)
(Item1, 300)

Then the reduce function which gets data for this key (Item1) will receive all these (key, value) pairs as input after the shuffle phase.


[Item1,(300,250,340,450,120,540,290,300)]

Resulting in final output - (Item1, 540)

If you are using a combiner in MapReduce job and the reducer class itself is used as the combiner class then combiner will be called for each map output.

Map-1 Combiner output

      (Item1, 450)

Map-2 Combiner output

      (Item1, 540)

Input to Reducer - [Item1, (450, 540)]

Resulting in final output - (Item1, 540)

So you can see by using a combiner map output is reduced which means less data is written to disk and less data is transferred to reducer nodes.

How to write a Combiner function

For writing Combiner class you need to extend Reducer and implement the reduce method just like you do for writing the reducer. In fact in many cases reducer itself can be used as the Combiner.

The output key value types of combiner must be same as the output key value type of the mapper.

Though it is not always possible to use the reducer as the combiner class, classic example of this constraint is calculation of average.

For example- If there are two maps with (key, value) pair as following

Map-1 (1,4,7) and Map-2 (8,9)

Then reduce function will calculate average as – (1+4+7+8+9)/5 = 29/5 = 5.8

where as with combiner where average will also be calculated per map output

Map-1 – (1+4+7)/3 = 12/3 = 4

Map-2 – (8+9)/2 = 17/2 = 8.5

So the average calculated at reduce side will be – (4+8.5)/2 = 12.5/2 = 6.25

Combiner with MapReduce example

Here is a example where combiner is specified while calculating maximum sales figure per item.


import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class MaxSales extends Configured implements Tool{
    // Mapper
    public static class MaxSalesMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
        private Text item = new Text();
        public void map(LongWritable key, Text value, Context context) 
                 throws IOException, InterruptedException {
             // Splitting the line on tab
             String[] stringArr = value.toString().split("\t");
             item.set(stringArr[0]);
             Integer sales = Integer.parseInt(stringArr[1]);
             context.write(item, new IntWritable(sales));
         }
    }

    // Reducer
    public static class MaxSalesReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
        private IntWritable result = new IntWritable();
        public void reduce(Text key, Iterable<IntWritable> values, Context context) 
                throws IOException, InterruptedException {
            int    maxSalesValue = Integer.MIN_VALUE;
            for(IntWritable val : values) {
                maxSalesValue = Math.max(maxSalesValue, val.get());
            }  
            result.set(maxSalesValue);
            context.write(key, result);
        }
    }
    public static void main(String[] args) throws Exception {
        int exitFlag = ToolRunner.run(new MaxSales(), args);
        System.exit(exitFlag);

    }

    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "MaxSales");
        job.setJarByClass(getClass());
        job.setMapperClass(MaxSalesMapper.class); 
        // Specifying combiner class
        job.setCombinerClass(MaxSalesReducer.class);
        job.setReducerClass(MaxSalesReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
    }
}

In the displayed counters for the MapReduce job you can see the reduction in number of records passed to reducer.

  Map input records=21
  Map output records=21
  Map output bytes=225
  Map output materialized bytes=57
  Input split bytes=103
  Combine input records=21
  Combine output records=4
  Reduce input groups=4
  Reduce shuffle bytes=57
  Reduce input records=4
  Reduce output records=4

For comparison here are the counters when the same MapReduce job is run without a Combiner class.

  Map input records=21
  Map output records=21
  Map output bytes=225
  Map output materialized bytes=273
  Input split bytes=103
  Combine input records=0
  Combine output records=0
  Reduce input groups=4
  Reduce shuffle bytes=273
  Reduce input records=21
  Reduce output records=4
  Spilled Records=42

That's all for this topic Using Combiner to Improve MapReduce Performance in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

Related Topics

You may also like -

>>>Go to Hadoop Framework Page

BeanFactoryPostProcessor in Spring Framework

JDBCTemplate With ResultSetExtractor Example in Spring

@Import Annotation in Spring JavaConfig

ServiceLocatorFactoryBean in Spring Framework

Internationalization (i18n) Using MessageSource in Spring

How to Handle Missing And Under Replicated Blocks in HDFS

How to Write a Map Only Job in Hadoop MapReduce

Parquet File Format in Hadoop

How to Read And Write Parquet File in Hadoop

Converting Text File to Parquet File Using Hadoop MapReduce

Chaining MapReduce Job in Hadoop

Predefined Mapper And Reducer Classes in Hadoop

How to Check Hadoop MapReduce Logs

Shuffle And Sort Phases in Hadoop MapReduce

Apache Avro Format in Hadoop

How to Read And Write Avro File in Hadoop

Using Avro File With Hadoop MapReduce

ToolRunner and GenericOptionsParser in Hadoop

What Are Counters in Hadoop MapReduce

Using Combiner to Improve MapReduce Performance in Hadoop