HBase, HDFS and Hive

References
HBase with Java API: https://dzone.com/articles/handling-big-data-hbase-part-4
HBase web site, http://hbase.apache.org/
HBase wiki, http://wiki.apache.org/hadoop/Hbase
HBase Reference Guide http://hbase.apache.org/book/book.html
HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide
Google Bigtable Paper, http://labs.google.com/papers/bigtable.html
Hadoop web site, http://hadoop.apache.org/
Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide
Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk
Sample code, https://github.com/sleberknight/basic-hbase-examples

———-

Datawarehouse implementation using Hadoop+Hbase+Hive+SpringBatch – Part 1

 

——–

—–

Hive Manual: https://cwiki.apache.org/confluence/display/Hive/LanguageManual

——

What is hive?: Hive is a data warehousing infrastructure based on Hadoop
What is Hbase?: Its a distributed, versioned, column-oriented NoSQL data store, modeled after Googles Bigtable. used to host very large tables — billions of rows *times* millions of columns.
What is hadoop?: Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware using map-reduce programming paradigm.

Hbase, Hive and HDFS

Reference: http://blog.nbostech.com/2013/03/hadoop-hive-hbase-installation-on-mac-os-x/

Learn this to solve Big Problems

https://www.coursera.org
Hadoop Platform and Application Framework
by University of California, San Diego
————-
https://university.mongodb.com/courses/MongoDB/
M101J: MongoDB for Java Developers

http://orientdb.com/docs/3.0.x/

————-
https://www.elastic.co/
https://polimetlase.wordpress.com/?s=elasticsearch

————-
CDAP
http://cask.co/products/cdap/

Hortonworks Data Platform
https://hortonworks.com/

————-

Hive Modeling / Hive Queries
https://hive.apache.org/

HDFS
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
————-
MapReduce
https://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

http://spark.apache.org/docs/latest/
Spark Scala API (Scaladoc)
Spark Java API (Javadoc)
Spark Python API (Sphinx)
Spark R API (Roxygen2)

http://twill.apache.org/
Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus instead on their application logic. Apache Twill allows you to use YARN’s distributed capabilities with a programming model that is similar to running threads.
————-
https://tika.apache.org/
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

https://nifi.apache.org/
An easy to use, powerful, and reliable system to process and distribute data.

————-
https://www.docker.com/
Docker is an open platform for developers and sysadmins to build, ship, and run distributed applications, whether on laptops, data center VMs, or the cloud.

https://nixos.org
Used to package PY/C++ into docker.

————-
Domain Knowledge:
Digital Asset Management (DAM)
https://polimetlase.wordpress.com/2017/03/20/digital-asset-management/
PRISM – https://polimetlase.wordpress.com/2017/03/10/categorize-and-search-documents/

————-

Apache Kafka: A Distributed Streaming Platform.
https://kafka.apache.org/

https://flume.apache.org/
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
————-

git
jira
wiki
————-

Free Articles to Test BigData System

Problem Statement: We want to build knowledge graphs, search repos, data classification, ….etc in BigData
How to get test data?

https://www.plos.org/

160,000+ peer-reviewed articles are free to access, reuse and redistribute.
Anytime, Anywhere.

We can use as test data, where it required

We need to create our account in plos.org to browse and download articles

http://journals.plos.org/plosmedicine/article/file?id=10.1371/journal.pmed.1002097&type=manuscript
This shows nicely how they are organizing document. We can download this and use it.

————
Programmatic Access to Articles​

PLOS articles can be accessed programmatically through our API, via PubMed Central, or using Europe PMC’s RESTful Web Service and SOAP Web Service. Detailed information about our Search API, including examples, is available at http://api.plos.org/solr/faq/. If you have any questions or require assistance with our API, please contact webmaster@plos.org.

http://journals.plos.org/plosone/s/help-using-this-site
————-
9,356 Journals
6,790 searchable at Article level
129 Countries
2,457,588 Articles

Huge collection available
https://doaj.org/api/v1/docs

These APIs are not working.

————–
*****
https://www.ncbi.nlm.nih.gov/pubmed

————–

HDFS Notes

HDFS Architecture:
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

HDFS Command Guide:
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html

HDFS is not POSIX compliant.
The Portable Operating System Interface (POSIX) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems.

HDFS User Guide:
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

JAVA API: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html

C/libhdfs: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/LibHdfs.html

WebHDFS API:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

Common HDFS Commands:
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0-Win/bk_HDP_Install_Win/content/ref-d4ba8d91-cfe7-4655-8181-0168cc6d2681.1.html

WebHDFS Rest API
https://hadoop.apache.org/docs/r1.0.4/webhdfs.html

CDAP – DataSets Design Issues

How to write group by and order by query in CDAP?

Working Hive Query:
select it, count(result) from (
SELECT from_unixtime(insert_time,’yyyy-MM-dd’) it, result
FROM default.dataset_table1) t1 group by it sort by it

In Oracle we have Order By. In Hive we have Sort By.

from_unittime() function takes seconds. Not milliseconds.
While writing data into datasets, we need to use
Instant.now().getEpochSecond();

This wont work with from_unixtime
System.currentTimeMillis();

———————————

http://docs.cask.co/cdap/4.2.0-SNAPSHOT/en/developers-manual/data-exploration/tables.html
Column values must be of a primitive type. A primitive type is one of boolean, int, long, float, double, bytes, or string.
Column names must be valid Hive column names. This means they cannot be reserved keywords such as drop. Please refer to the Hive language manual for more information about Hive.

Data types are from Avro.
Data is stored in Hive. Supporting Hive Query.

This imposes constraints on how to design datasets and how to write queries.
Also this will impact performance of queries, because of date and time conversions.

———————————
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries