BigData – PDF Extraction

Option 1: Extract text as part of transformation in transit.
http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html

Option 2: Extract text at rest from HDFS
https://community.hortonworks.com/questions/85102/nifi-sftp-mutiple-files-into-mutiple-hdfs-director.html

Apache Tika, Apache PDFBox libraries will be used.

-o-

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s