Adaltas Cloud Academy
Sign out >

New York City Taxi

The New York City Taxi & Limousine Commission provides a public data set about taxi rides in New York City from 2009 to 2019. This set is best used as a timeseries dataset.

Description of data

The dataset contains all information related to the time and location of pickup and drop-off as well as information on the drivers’ revenue. The data is organized by year and month and stored in HDFS in uncompressed type from year 2010 to 2013.

  • The data takes about 116 GB in text CSV format.
  • The storage path of the data in HDFS is /data/nyc_taxi

Schema of trip data

The trip dataset contains informations about taxi trip in New York City. Each trip is represented by multiples fields:

Fields Types Descriptions
vendor_id String Verifone Transportation Systems (VTS), or Mobile Knowledge Systems Inc (CMT), implemented as part of the Technology Passenger Enhancements Project
pickup_datetime DateTime the start time of the trip, mm-dd-yyyy hh24:mm:ss EDT
dropoff_datetime DateTime the end time of the trip, mm-dd-yyyy hh24:mm:ss EDT
passenger_count Short the number of passengers on the trip, default value is one
trip_distance Float the trip distance measured by the taximeter in miles.
pickup_longitude and pickup_latitude Float the GPS coordinates at the start of the trip
rate_code Short the taximeter rate
store_and_fwd_flag Short this flag indicates whether the trip record was held in vehicle memory before sending to the vendor (1= store and forward trip, 0= not a store and forward trip)
dropoff_longitude and dropoff_latitude Float the GPS coordinates at the end of the trip
trip_time_in_secs Float the trip time measured by the taximeter in seconds
payment_type String the Cash or credit card.
fare_amount Float the meter fare, it should include the Newark surcharge, in USD
surcharge Float the extra fees, such as rush hour and overnight surcharges, in USD
mta_tax Float the metropolitan commuter transportation mobility tax, in USD
tip_amount Float the tip amount, in USD
tolls_amount Float the total price paid for tolls, summed across all tolls for the trip, in USD
total_amount Float all charges that are presented to the passenger at time of fare payment (includes tip for non-cash trips), in USD

Hive Tables

The data from this original site being not standard enough, we have used a modified subset of this data from the University of Illinois Data Bank which generated a standardized dataset from 2010 to 2013 in which they split the trips and fares in two tables. This dataset is described here. We used these in order to create Hive tables of exploitable data.

Hive supports multiple file formats, the data can be stored in any type depending to the use case and the purpose of analysis. Its default file format being ORC, our data is saved in this format to take advantage of the optimized performance and its subsequent use in BI or data analysis. Likewise, AVRO is included as another storage format that could be further used in stream processing.

The Hive database datasets contains the following tables:

  • taxi_trip_orc
  • taxi_fare_orc
  • taxi_trip_avro
  • taxi_fare_avro

The tables are available in the ORC and Avro formats.

Feel free to query these to explore the NYC Taxi dataset.

Example query

0: jdbc:hive2://zoo-1.au.adaltas.cloud:2181,z> select year, count(*) tot from datasets.taxi_trip_orc group by year order by tot desc;
+-----------------------------+------------+
|            year             |    tot     |
+-----------------------------+------------+
| 2012                        | 178544324  |
| 2011                        | 176897199  |
| 2013                        | 173179759  |
| 2010                        | 169001153  |
+-----------------------------+------------+

Additional information

TLC Trip Record Data official page