This is pretty old age problem to be solved in majority of projects.
History: It comes under Flow based programming: https://en.wikipedia.org/wiki/Flow-based_programming
Our focus is to move data from system A to system B. Only Extraction and Loading. Not much about Transformations.
Option 0: Hand coding in Python / Java / PERL …etc
This is good for small sets of data. Also good for POC.
Not suggested to push to production without failover, managing jobs, scheduling jobs,…etc
Option 1: If system is heavy and need robust solution, better to go with Apache NiFi
The US National Security Agency open-sourced its Niagrafiles, or NiFi, data-flow software.
How to enable security for NiFi?
How to write Java code for NiFi and other languages?
Other directory with date suffix examples
Commercial support available:
Externalizing variables possible.
Easy to move configurations from QA to Prod
We can slim down the system to minimize its foot print
NiFi support Hadoop HDFS
But Storm objective is different.
Option 2: Use streaming API of Apache Spark
Sqoop Vs Flume
Option 3: If you are using CDAP, better to use Hydrator to generate JSON and use it.
Bit more study required around metrics, management and tracking these jobs.
Better to stay away from CDAP stack. There is not much public acceptance. No response on their forums. If we ask question, they wont respond. If we call them, they will ask us to buy their support/consulting hours. Nothing wrong in this. But we can’t afford.
We can check their poor support in their groups
Option 4: Pentaho Kettle
It is not ready for Big Data as on March 2017
Good for small java enterprise projects (Coding required with Kettle API). Used in the past.
http://javadoc.pentaho.com/kettle/ – Java documentation quality is not good.
Option 5: Commercial products
http://www.alteryx.com/ is good product and it is having better support with https://www.tableau.com/ (BI/Analytics)
Option 6: Spring Batch
If we want to minimize number of servers, we want minimal solution, Spring Batch is good one.
But it needs continuous maintenance when there is change in Spring / Java version.
Spring Integration: http://docs.spring.io/spring-integration/reference/html/ftp.html
Spring batch partitioning: https://keyholesoftware.com/2013/12/09/spring-batch-partitioning/
Spring Batch Reference: http://docs.spring.io/spring-batch/reference/html/index.html
Spring Batch UI: http://docs.spring.io/spring-batch-admin/reference/reference.xhtml
Use Apache NiFi as much as possible. Works well in production and also quick in POCs
As on March 11 2017: https://groups.google.com/forum/#!topic/cdap-user/hiuUP3jIxNs
CDAP Hydrator is not in a position to compete with Apache NiFi