Thursday, November 11, 2021

How to Hive on GCP Using Google DataProc and Cloud Storage: Part 1

Google Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open-source data tools for batch processing, querying, streaming, and machine learning. This includes the Hadoop ecosystem (HDFS, Map/Reduce processing framework, and a number of applications such as Hive, Mahout, Pig, Spark, and Hue that are built on top of Hadoop). Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Queries submitted via HIVE are converted into Map/Reduce jobs that access stored data,  results are then aggregated and returned to the user or application. 

For this exercise, we will be using New York city's yellow and green taxi trip data accumulated for the year 2019. Yellow Taxis are the only vehicles licensed to pick up street-hailing passengers anywhere in NYC while Green Taxis provide street hail service and prearranged service in northern Manhattan (above E 96th St and W 110th St) and in the outer boroughs. The dataset is available at the city portal.



from DZone.com Feed https://ift.tt/3wCEVkf

No comments:

Post a Comment