Dealing with Files in Hadoop

Use Case: We have 1 million files to process and provide option to download.

Hadoop is meant to bring process to data. We can store processed file content or meta data in HBase to support easy search. Upon successful search, user want to see original document. During that time we can download file from NAS easily.

HDFS: This is not meant to store large files. 16MB is block size. We can configure to support to store small files. But not supposed to be.

HBase: Default block size is 100kb. We can tweak, but not meant to store proprietary data formats.

NAS: Network Attached Storage is easy to store/retrieve original files, When we don’t have map reduce nature of jobs.

Advertisements

JSON – Java

https://jsonformatter.curiousconcept.com/

Feed JSON object from URL to following URL.
This gives Java class.

https://timboudreau.com/blog/json/read

Using this Java class, build mock service.
Note: This is having issues when there are nested elements
———————————–

This is well sophisticated way to convert JSON to Java Classes.

http://www.jsonschema2pojo.org/

Note: Before you give JSON as input, find null values and make them some values. So that they will picked up and becomes variables. Otherwise it will cause trouble when it see values during run time.

——————————–

https://www.mkyong.com/java/jackson-2-convert-java-object-to-from-json/

—————–

JSON Path

https://github.com/json-path/JsonPath

This is similar to XPath.
Very useful when we need only few elements.

——————-

Text Processing

Text Processing Architecture

Open Search Text Server
http://www.opentext.com/what-we-do/industries/legal/legal-content-management-edocs/opentext-search-server-edocs-edition

Noggle
https://www.noggle.online/knowledgebase/cognitive-search-engine/

http://blogs.forrester.com/mike_gualtieri/17-06-12-cognitive_search_is_the_ai_version_of_enterprise_search
Cognitive Search Is The AI Version Of Enterprise Search

https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html

Java – selenium – Read Web Page Content

Reading HTTP Client is causing problems. Firewalls are detecting and closing connections.
The best route is mimic like browser.
Tested with Mac 10.12.4

package com.bible;

import java.io.BufferedWriter;
import java.io.File;
import java.util.concurrent.TimeUnit;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;

//Chrome Driver: https://chromedriver.storage.googleapis.com/index.html?path=2.29/

/**
 * 
 */
public class ReadPassage {

	private static BufferedWriter writer;

	public static void main(String[] args) {
		WebDriver driver;

		File file = new File("resources/chromedriver");
		String absolutePath = file.getAbsolutePath();
		System.out.println("Chrome Driver Path==>" + absolutePath);
		System.setProperty("webdriver.chrome.driver", absolutePath);

		String BROWSER_URL = "http://usccb.org/bible/readings/061017.cfm";

		driver = new ChromeDriver();
		driver.manage().window().maximize();
		driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
		driver.get(BROWSER_URL);

		WebElement myDynamicElement = (new WebDriverWait(driver, 30))
				.until(ExpectedConditions.presenceOfElementLocated(By.id("readingssignup")));

		String source = driver.getPageSource();
		System.out.println(source);
		driver.close();

		try {
			Thread.sleep(10 * 1000);
		} catch (InterruptedException e) {
			e.printStackTrace();
		}

	}

}

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<groupId>com.bible</groupId>
	<artifactId>my-bible</artifactId>
	<packaging>jar</packaging>
	<version>1.0-SNAPSHOT</version>
	<name>my-bible</name>
	<url>http://maven.apache.org</url>
	<dependencies>

		<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
		<dependency>
			<groupId>org.apache.httpcomponents</groupId>
			<artifactId>httpclient</artifactId>
			<version>4.5.3</version>
		</dependency>

		<!-- https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-java -->
		<dependency>
			<groupId>org.seleniumhq.selenium</groupId>
			<artifactId>selenium-java</artifactId>
			<version>3.3.1</version>
		</dependency>

		<!-- https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-chrome-driver -->
		<dependency>
			<groupId>org.seleniumhq.selenium</groupId>
			<artifactId>selenium-chrome-driver</artifactId>
			<version>3.3.1</version>
		</dependency>



		<dependency>
			<groupId>junit</groupId>
			<artifactId>junit</artifactId>
			<version>3.8.1</version>
			<scope>test</scope>
		</dependency>
	</dependencies>
</project>

MicroServices

Micro Services is a quick way to serve UI needs.

https://jaxenter.com/microservices-trends-2017-survey-133265.html
Micro Services Comparison

Python and Flask
https://stackoverflow.com/questions/10938360/how-many-concurrent-requests-does-a-single-flask-process-receive

Micro Services – Performance Comparison
https://cdelmas.github.io/2016/06/20/Performance-of-Microservices-frameworks.html

References:
http://microservices.io/
https://apigee.com/about/blog/cto-musings/api-best-practices-microservices
https://www.mulesoft.com/webinars/api/microservices-architecture

Address following while choosing Micro Services

Domain Driven Design

Performance
Security
Concurrency
Availability of Engineers
Easy to install/maintain/monitor (Dev Ops)
Easy to develop (Developers)
Session handling
Testing
Debugging
Logging

Commercial Support when needed
Future of Project
License
Support in Amazon AWS and Microsoft Azure