Hortonworks Data Platform
Hive Modeling / Hive Queries
Spark Scala API (Scaladoc)
Spark Java API (Javadoc)
Spark Python API (Sphinx)
Spark R API (Roxygen2)
Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus instead on their application logic. Apache Twill allows you to use YARN’s distributed capabilities with a programming model that is similar to running threads.
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
An easy to use, powerful, and reliable system to process and distribute data.
Docker is an open platform for developers and sysadmins to build, ship, and run distributed applications, whether on laptops, data center VMs, or the cloud.
Used to package PY/C++ into docker.
Digital Asset Management (DAM)
PRISM – https://polimetlase.wordpress.com/2017/03/10/categorize-and-search-documents/
Apache Kafka: A Distributed Streaming Platform.
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.