A few weeks ago, I was demonstrating how to use a generic, Talend-based-data offloading framework that I developed late last year. The framework allows a data team to download data from source data stores like Oracle, MySQL, MS SQLServer, etc, to perform various operations on the data before uploading it to various cloud platforms like Amazon Web Services (AWS) S3 buckets, AWS Redshift, and Azure data stores (in addition to uploading to HDFS and Hive). The data offloading activity is managed as a combination of atomic tasks that can be executed in a batch, one after the other. For example, to offload an Oracle table to Redshift (via an S3 bucket), we need to create a batch of the following atomic tasks — download from Oracle to file; upload the file to S3; and copy it from S3 to Redshift.
This approach allows for easy addition and/or deletion of tasks. For example, the task of archiving the data can easily be included — assuming the 'archive' functionality is available in the framework — into the batch.
from DZone.com Feed https://ift.tt/2V1KZiF
No comments:
Post a Comment