New York City Taxi
The New York City Taxi & Limousine Commission provides a public data set about taxi rides in New York City from 2009 to 2019. This set is best used as a timeseries dataset.
Description of data
The dataset contains all information related to the time and location of pickup and drop-off as well as information on the drivers’ revenue. The data is organized by year and month and stored in HDFS in uncompressed type from year 2010 to 2013.
- The data takes about
116 GB
in text CSV format. - The storage path of the data in HDFS is
/data/nyc_taxi
Schema of trip data
The trip
dataset contains informations about taxi trip in New York City.
Each trip is represented by multiples fields:
Fields | Types | Descriptions |
---|---|---|
vendor_id | String | Verifone Transportation Systems (VTS), or Mobile Knowledge Systems Inc (CMT), implemented as part of the Technology Passenger Enhancements Project |
pickup_datetime | DateTime | the start time of the trip, mm-dd-yyyy hh24:mm:ss EDT |
dropoff_datetime | DateTime | the end time of the trip, mm-dd-yyyy hh24:mm:ss EDT |
passenger_count | Short | the number of passengers on the trip, default value is one |
trip_distance | Float | the trip distance measured by the taximeter in miles. |
pickup_longitude and pickup_latitude | Float | the GPS coordinates at the start of the trip |
rate_code | Short | the taximeter rate |
store_and_fwd_flag | Short | this flag indicates whether the trip record was held in vehicle memory before sending to the vendor (1= store and forward trip, 0= not a store and forward trip) |
dropoff_longitude and dropoff_latitude | Float | the GPS coordinates at the end of the trip |
trip_time_in_secs | Float | the trip time measured by the taximeter in seconds |
payment_type | String | the Cash or credit card. |
fare_amount | Float | the meter fare, it should include the Newark surcharge, in USD |
surcharge | Float | the extra fees, such as rush hour and overnight surcharges, in USD |
mta_tax | Float | the metropolitan commuter transportation mobility tax, in USD |
tip_amount | Float | the tip amount, in USD |
tolls_amount | Float | the total price paid for tolls, summed across all tolls for the trip, in USD |
total_amount | Float | all charges that are presented to the passenger at time of fare payment (includes tip for non-cash trips), in USD |
Hive Tables
The data from this original site being not standard enough, we have used a modified subset of this data from the University of Illinois Data Bank which generated a standardized dataset from 2010 to 2013 in which they split the trips and fares in two tables. This dataset is described here. We used these in order to create Hive tables of exploitable data.
Hive supports multiple file formats, the data can be stored in any type depending to the use case and the purpose of analysis. Its default file format being ORC, our data is saved in this format to take advantage of the optimized performance and its subsequent use in BI or data analysis. Likewise, AVRO is included as another storage format that could be further used in stream processing.
The Hive database datasets
contains the following tables:
taxi_trip_orc
taxi_fare_orc
taxi_trip_avro
taxi_fare_avro
The tables are available in the ORC and Avro formats.
Feel free to query these to explore the NYC Taxi dataset.
Example query
0: jdbc:hive2://zoo-1.au.adaltas.cloud:2181,z> select year, count(*) tot from datasets.taxi_trip_orc group by year order by tot desc;
+-----------------------------+------------+
| year | tot |
+-----------------------------+------------+
| 2012 | 178544324 |
| 2011 | 176897199 |
| 2013 | 173179759 |
| 2010 | 169001153 |
+-----------------------------+------------+