Software Selection and Evaluation

Quantitative Methods for Software Selection and Evaluation
ftp://ftp.cert.org/public/documents/06.reports/pdf/06tn026.pdf

A Process for COTS (commercial off-the-shelf) Software Product Evaluation
http://www.sei.cmu.edu/reports/03tr017.pdf

Advertisements

Text Processing

Text Processing Architecture

Open Search Text Server
http://www.opentext.com/what-we-do/industries/legal/legal-content-management-edocs/opentext-search-server-edocs-edition

Noggle
https://www.noggle.online/knowledgebase/cognitive-search-engine/

http://blogs.forrester.com/mike_gualtieri/17-06-12-cognitive_search_is_the_ai_version_of_enterprise_search
Cognitive Search Is The AI Version Of Enterprise Search

https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html

MicroServices

Micro Services is a quick way to serve UI needs.

https://jaxenter.com/microservices-trends-2017-survey-133265.html
Micro Services Comparison

Python and Flask
https://stackoverflow.com/questions/10938360/how-many-concurrent-requests-does-a-single-flask-process-receive

Micro Services – Performance Comparison
https://cdelmas.github.io/2016/06/20/Performance-of-Microservices-frameworks.html

References:
http://microservices.io/
https://apigee.com/about/blog/cto-musings/api-best-practices-microservices
https://www.mulesoft.com/webinars/api/microservices-architecture

Address following while choosing Micro Services

Domain Driven Design

Performance
Security
Concurrency
Availability of Engineers
Easy to install/maintain/monitor (Dev Ops)
Easy to develop (Developers)
Session handling
Testing
Debugging
Logging

Commercial Support when needed
Future of Project
License
Support in Amazon AWS and Microsoft Azure

Moving data from system A to system B

This is pretty old age problem to be solved in majority of projects.

History: It comes under Flow based programming: https://en.wikipedia.org/wiki/Flow-based_programming

Scope:
Our focus is to move data from system A to system B. Only Extraction and Loading. Not much about Transformations.

———————————
Option 0: Hand coding in Python / Java / PERL …etc
This is good for small sets of data. Also good for POC.
Not suggested to push to production without failover, managing jobs, scheduling jobs,…etc

———————————

Option 1: If system is heavy and need robust solution, better to go with Apache NiFi
https://nifi.apache.org/

The US National Security Agency open-sourced its Niagrafiles, or NiFi, data-flow software.
https://en.wikipedia.org/wiki/Apache_NiFi

How to enable security for NiFi?
http://ijokarumawak.github.io/nifi/2016/11/15/nifi-auth/

How to write Java code for NiFi and other languages?
https://community.hortonworks.com/questions/75977/run-java-code-in-apache-nifi.html

Other directory with date suffix examples
https://community.hortonworks.com/questions/44215/is-there-a-processor-in-nifi-that-can-create-many.html

Commercial support available:
https://hortonworks.com/apache/nifi/

Versioning available:
https://community.hortonworks.com/questions/61475/nifi-workflow-version-control-deployment.html

Externalizing variables possible.
Easy to move configurations from QA to Prod

We can slim down the system to minimize its foot print
https://community.hortonworks.com/articles/32605/running-nifi-on-raspberry-pi-best-practices.html

NiFi support Hadoop HDFS
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.hadoop.PutHDFS/

Alternatives:
http://storm.apache.org/index.html
But Storm objective is different.

———————————

Option 2: Use streaming API of Apache Spark
http://spark.apache.org/docs/latest/streaming-programming-guide.html
Sqoop Vs Flume
https://www.dezyre.com/article/sqoop-vs-flume-battle-of-the-hadoop-etl-tools-/176

———————————

Option 3: If you are using CDAP, better to use Hydrator to generate JSON and use it.
Bit more study required around metrics, management and tracking these jobs.
http://docs.cask.co/cdap/4.1.0/en/developers-manual/pipelines/developing-pipelines.html

https://github.com/cdap-guides/cdap-etl-guide
http://blog.cask.co/2016/06/bringing-relational-data-into-data-lakes/

Better to stay away from CDAP stack. There is not much public acceptance. No response on their forums. If we ask question, they wont respond. If we call them, they will ask us to buy their support/consulting hours. Nothing wrong in this. But we can’t afford.
http://cask.co/support/

We can check their poor support in their groups
https://groups.google.com/forum/#!forum/cdap-user

———————————

Option 4: Pentaho Kettle
http://wiki.pentaho.com/display/BAD/Kettle+on+Spark
It is not ready for Big Data as on March 2017
Good for small java enterprise projects (Coding required with Kettle API). Used in the past.
http://javadoc.pentaho.com/kettle/ – Java documentation quality is not good.
https://community.hortonworks.com/questions/24014/what-is-the-difference-between-nifi-and-kettle.html

———————————

Option 5: Commercial products

https://www.talend.com/
http://www.robertomarchetto.com/talend_studio_vs_kettle_pentao_pdi_comparison
https://streamsets.com/

http://www.alteryx.com/ is good product and it is having better support with https://www.tableau.com/ (BI/Analytics)

———————————

Option 6: Spring Batch

If we want to minimize number of servers, we want minimal solution, Spring Batch is good one.
But it needs continuous maintenance when there is change in Spring / Java version.

Spring Integration: http://docs.spring.io/spring-integration/reference/html/ftp.html
Spring batch partitioning: https://keyholesoftware.com/2013/12/09/spring-batch-partitioning/
Spring Batch Reference: http://docs.spring.io/spring-batch/reference/html/index.html
Spring Batch UI: http://docs.spring.io/spring-batch-admin/reference/reference.xhtml

———————————

Conclusion:
Use Apache NiFi as much as possible. Works well in production and also quick in POCs

As on March 11 2017: https://groups.google.com/forum/#!topic/cdap-user/hiuUP3jIxNs
CDAP Hydrator is not in a position to compete with Apache NiFi
-0-

Interconnection Oriented Architecture – IOA from Equinix

http://www.equinix.com/

Shorten the Distance between your applications and data, and the people (Customers,
Employees and Partners).
Localize Traffic and Services across all the locations and markets you need to reach and
regionalize services globally.
Integrate and Deliver via Ecosystem Exchanges leveraging multiple clouds & SaaS
providers to increase your rate of change while interconnecting with the swarm of digital
partners.
Locate Data and Analytics Adjacent to improve response times and distributed scale
while reducing the amount of data traversing the networks.

For more information, please download pdf copy from their site.
http://www.equinix.com/resources/whitepapers/ioa-playbook/

It is very interesting read and we need to make use of their experience.

Micro Service Vs Web Service

Service: Serving somebody. Taking inputs, processing it and giving the response.
First, we are aware of Servlets
For UI purposes JSP came into the picture.

When mixed Core Java / JSP, we developed WAR files to deploy independent applications.
Later deployed servlets / REST Services as independent components.

Micro Service: Independently deployable components.
We did the same in the past. This was simplified with latest technologies like Spring Boot, AKA, …etc

Reads:
http://www.tatvasoft.com/blog/the-difference-between-micro-services-and-web-services/
http://stackoverflow.com/questions/27054162/what-are-rest-restful-soa-and-microservices-in-simple-terms
https://projects.spring.io/spring-boot/
https://spring.io/blog/2015/07/14/microservices-with-spring

ACID vs base database

The relational databases strongly follow the ACID (Atomicity, Consistency, Isolation, and Durability) properties while the NoSQL databases follow BASE (Basically Available, soft State, eventual consistency) principles.

We need RDBMS for the transactional purpose.
Better to use BASE for searches, aggregations, recommendations,…etc

When not to use Cassandra
http://stackoverflow.com/questions/2634955/when-not-to-use-cassandra