FANDOM


Concepts and Terms

Sample Data

  • KDD Cup 1999 Data
    • This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

Distributions

Hortonworks Data Platform (HDP)

Cloudera Distribution for Hadoop (CDH)

Hadoop

  • http://hadoop.apache.org/
  • Desc. : an open-source software for reliable, scalable, distributed computing.
  • License : Apache

References

Readings

Tips and Tricks

Hadoop process command sample

/usr/java/default/bin/java
-Dproc_datanode -Xmx1024m
-Djava.net.preferIPv4Stack=true
-Djava.net.preferIPv4Stack=true
-Djava.net.preferIPv4Stack=true
-Dhadoop.log.dir=/var/log/hadoop/hdfs
-Dhadoop.log.file=hadoop.log
-Dhadoop.home.dir=/usr/lib/hadoop
-Dhadoop.id.str=hdfs
-Dhadoop.root.logger=INFO,console
-Djava.library.path=:/usr/lib/hadoop/lib/native/Linux-amd64-64:/usr/lib/hadoop/lib/native
-Dhadoop.policy.file=hadoop-policy.xml
-Djava.net.preferIPv4Stack=true
-Dhadoop.log.dir=/var/log/hadoop/hdfs
-Dhadoop.log.file=hadoop-hdfs-datanode-bigdata01.log
-Dhadoop.home.dir=/usr/lib/hadoop
-Dhadoop.id.str=hdfs
-Dhadoop.root.logger=INFO,RFA
-Djava.library.path=:/usr/lib/hadoop/lib/native/Linux-amd64-64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop/lib/native/Linux-amd64-64:/usr/lib/hadoop/lib/native/Linux-amd64-64:/usr/lib/hadoop/lib/native
-Dhadoop.policy.file=hadoop-policy.xml
-Djava.net.preferIPv4Stack=true -server -Xmx1024m
-Dhadoop.security.logger=ERROR,DRFAS
-Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode

HBase

Hive

Administration

Properties for authentication or authorization

  • hive.server2.authentication
    • default : NONE
    • options : NONE, LDAP, KERBEROS, CUSTOM, PAM, NOSASL
  • hiveconf hive.security.authenticator.manager
    • default : org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator
    • options
      • org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator
      • org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
  • hive.server2.enable.doAs
    • default : true
    • options : true, false
  • hive.security.authorization.enabled
    • default : false
    • options : true, false
  • hive.security.authorization.manager
    • default : org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider
    • options
      • org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider
      • org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider
      • org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory

Configure authorization mode

  • Default Hive Authorization
    • config
      • hive.security.authorization.enabled=true
      • hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider
      • hive.security.authorization.createtable.owner.grants=ALL
  • Storage Based Authorization in the Metastore Server
    • config
      • hive.security.authorization.enabled=true
      • hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider
      • hive.security.metastore.authenticator.manager=org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator
      • hive.security.metastore.authorization.auth.reads=true
  • SQL Standards Based Authorization
    • config
      • hive.server2.enable.doAs=false
      • hive.users.in.admin.role=hive,hdfs
      • hive.security.authorization.enabled=true
      • hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
      • hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator
      • hive.security.metastore.authorization.auth.reads=true

Connectivity

Session timeout

  • Hive 0.14 introduced configuration properties to control session timeout and operation timeout

Tips and Tricks

Get all configuration properties or each property respectively

To get all configuration properties defined at the Hive server, run set -v command.

hive> SET -v;

Outside Hive shell, you can use -e option

# hive -e 'set -v'

To get a specific configuration property, just run set property-name like the following.

hive> SET hive.server2.authentication;

Collect JAR archive to setup JDBC connection with HiveServer2

You can collect necessary JAR archives using the following shell script using Maven's dependency plugin.

  1. #!/bin/bash
  2. declare -a artifacts=(
  3.    commons-configuration:commons-configuration:1.10
  4.    commons-logging:commons-logging:1.2
  5.    com.google.guava:guava:17.0
  6.    org.apache.hadoop:hadoop-auth:2.4.1
  7.    org.apache.hadoop:hadoop-common:2.4.1
  8.    org.apache.hadoop:hadoop-hdfs:2.4.1
  9.    org.apache.hadoop:hadoop-mapreduce-client-core:2.4.1
  10.    org.spark-project.hive:hive-common:0.13.1
  11.    org.spark-project.hive:hive-exec:0.13.1
  12.    org.spark-project.hive:hive-jdbc:0.13.1
  13.    org.spark-project.hive:hive-metastore:0.13.1
  14.    org.spark-project.hive:hive-service:0.13.1
  15.    org.spark-project.hive:hive-shims:0.13.1
  16.    org.apache.thrift:libfb303:0.9.1
  17.    org.apache.thrift:libthrift:0.9.1
  18.    org.apache.httpcomponents:httpcore:4.3.3
  19.    org.apache.httpcomponents:httpclient:4.3.6
  20.    log4j:log4j:1.2.17
  21.    org.slf4j:slf4j-api:1.7.6
  22.    org.slf4j:slf4j-log4j12:1.7.6)
  23.  
  24. for item in "${artifacts[@]}"; do
  25.    mvn org.apache.maven.plugins:maven-dependency-plugin:2.9:get \
  26.    -Dartifact=$item \
  27.    -DremoteRepositories=central2::::http://repo2.maven.org/maven2/ \
  28.    -Dtransitive=false -Ddest=.
  29. done

For Windows, you can use the following batch script instead of above shell script.

  1. @echo off
  2. set artifacts=(commons-configuration:commons-configuration:1.10 ^
  3.    commons-logging:commons-logging:1.2 ^
  4.    com.google.guava:guava:17.0 ^
  5.    org.apache.hadoop:hadoop-auth:2.4.1 ^
  6.    org.apache.hadoop:hadoop-common:2.4.1 ^
  7.    org.apache.hadoop:hadoop-hdfs:2.4.1 ^
  8.    org.apache.hadoop:hadoop-mapreduce-client-core:2.4.1 ^
  9.    org.spark-project.hive:hive-common:0.13.1 ^
  10.    org.spark-project.hive:hive-exec:0.13.1 ^
  11.    org.spark-project.hive:hive-jdbc:0.13.1 ^
  12.    org.spark-project.hive:hive-metastore:0.13.1 ^
  13.    org.spark-project.hive:hive-service:0.13.1 ^
  14.    org.spark-project.hive:hive-shims:0.13.1 ^
  15.    org.apache.thrift:libfb303:0.9.1 ^
  16.    org.apache.thrift:libthrift:0.9.1 ^
  17.    org.apache.httpcomponents:httpcore:4.3.3 ^
  18.    org.apache.httpcomponents:httpclient:4.3.6 ^
  19.    log4j:log4j:1.2.17 ^
  20.    org.slf4j:slf4j-api:1.7.6 ^
  21.    org.slf4j:slf4j-log4j12:1.7.6)
  22.  
  23. for %%i in %artifacts% do (
  24.   mvn org.apache.maven.plugins:maven-dependency-plugin:2.9:get ^
  25.   -Dartifact=%%i ^
  26.   -DremoteRepositories=central2::::http://repo2.maven.org/maven2/ ^
  27.   -Dtransitive=false -Ddest=.
  28. )

Pig

  • http://pig.apache.org/
  • Desc. : a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
  • License :

Flume

  • https://cwiki.apache.org/FLUME/
  • Desc. : a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
  • License : Apache License
  • Written in : Java

Kafka

References

Readings

Kafka uses a very large number of files. At the same time, Kafka uses a large number of sockets to communicate with the clients. All of this requires a relatively high number of available file descriptors.
Sadly, many modern Linux distributions ship with a paltry 1,024 file descriptors allowed per process. This is far too low for even a small Kafka node, let alone one that hosts hundreds of partitions.

Associates

kafka-docker

Kafka Manager

kafka-manager-docker

Sqoop

Sqoop Execution Sequence

  • As of Sqoop 1.4.4 in HDP 1.6


  1. user
    • executes /usr/bin/sqoop
  2. /usr/bin/sqoop
    • sets SQOOP_HOME
    • executes /usr/lib/sqoop/bin/sqoop
  3. /usr/lib/sqoop/bin/sqoop
    • executes /usr/lib/sqoop/bin/configure-sqoop
    • executes /usr/bin/hadoop with org.apache.sqoop.Sqoop
  4. /usr/lib/sqoop/bin/configure-sqoop
    • executes /etc/sqoop/conf.dist/sqoop-env.sh if exists
    • exports SQOOP_CLASSPATH, SQOOP_CONF_DIR, SQOOP_JAR_DIR, HADOOP_CLASSPATH, HBASE_HOME, HIVE_CONF_DIR and et al.
  5. /etc/sqoop/conf.dist/sqoop-env.sh
    • exports HADOOP_HOME, HBASE_HOME, HIVE_HOME, SQOOP_USER_CLASSPATH and et al.
  6. /usr/bin/hadoop
    • executes /usr/lib/hadoop/bin/hadoop
  7. /usr/lib/hadoop/bin/hadoop
    • executes /usr/lib/hadoop/libexec/hadoop-config.sh
    • sets HADOOP_OPTS
    • executes java
  8. /usr/lib/hadoop/libexec/hadoop-config.sh
    • executes /etc/hadoop/conf.empty/hadoop-env.sh
    • builds CLASSPATH
    • adds -Dhadoop.log.dir, -Dhadoop.log.file, -Dhadoop.home.dir, -Dhadoop.id.str, -Dhadoop.root.logger into HADOOP_OPTS
  9. /etc/hadoop/conf.empty/hadoop-env.sh
    • exports HADOOP_HEAPSIZE, HADOOP_NAMENODE_OPTS, HADOOP_CLIENT_OPTS, HADOOP_CLASSPATH, JAVA_LIBRARY_PATH<code> and et al.
    • sets <code>HADOOP_JOBTRACKER_OPTS, HADOOP_TASKTRACKER_OPTS and et al.


/usr/bin/sqoop ---> /usr/lib/sqoop/bin/sqoop
                       |
                       +--> /usr/lib/sqoop/bin/configure-sqoop ---> /etc/sqoop/conf.dist/sqoop-env.sh
                       |              
                       +--> /usr/bin/hadoop org.apache.sqoop.Sqoop
                              |
                              +--> /usr/lib/hadoop/bin/hadoop
                                      |
                                      +--> /usr/lib/hadoop/libexec/hadoop-config.sh ---> /etc/hadoop/conf.empty/hadoop-env.sh
                                      |
                                      +--> java org.apache.sqoop.Sqoop

Tips and Tricks

Sqoop process command-line sample

/usr/java/default/bin/java -Xmx1024m
-Djava.net.preferIPv4Stack=true
-Dhadoop.log.dir=/var/log/hadoop/oozie
-Dhadoop.log.file=hadoop.log
-Dhadoop.home.dir=/usr/lib/hadoop
-Dhadoop.id.str=oozie
-Dhadoop.root.logger=INFO,console
-Djava.library.path=:/usr/lib/hadoop/lib/native/Linux-amd64-64:/usr/lib/hadoop/lib/native
-Dhadoop.policy.file=hadoop-policy.xml
-Djava.net.preferIPv4Stack=true -Xmx1024m
-Dhadoop.root.logger=INFO,DRFA
-Dhadoop.log.dir=/var/log/hadoop/oozie
-Dhadoop.log.file=hadoop.log
-Dhadoop.security.logger=INFO,NullAppender org.apache.sqoop.Sqoop eval
--connect jdbc:hsqldb:hsql://bigdata01/sample
--username flight --password flight_passwd
--query delete from flight.delay_stat_by_carrier
--verbose

Customizing Sqoop logging

Sqoop eventually calls Hadoop and so the logging is delegated to Hadoop. The logging of Hadoop is based on Log4j and can be configured in the command-line using Java system properties of hadoop.root.logger, hadoop.log.dir, and hadoop.log.file. Those system variables can be specified using the environment variable of HADOOP_CLIENT_OPTS by the Hadoop client.

So, you can execute sqoop configuring log4j logging like the following example


$ export HADOOP_CLIENT_OPTS="-Dhadoop.root.logger=INFO,DRFA -Dhadoop.log.dir=/var/log/hadoop/${USER} -Dhadoop.log.file=hadoop.log"
$ sqoop eval  --verbose \
> --connect jdbc:hsqldb:hsql://localhost/sample \
> --username flight --password flight_passwd \
> --query "delete from flight.delay_stat_by_carrier"


  • The default Hadoop's log4j.properties file (/etc/hadoop/conf/log4j.properties) defines hadoop.root.logger, </code>hadoop.log.dir</code>, and hadoop.log.file which can be overridden by system properies
  • The default Hadoop's log4j.properties contains a few predefined loggers named console, DRFA(DailyRollingFileAppender), DRFAS(DailyRollingFileAppender), DRFAAUDIT(DailyRollingFileAppender), RFA(RollingFileAppender), NullAppender and et al.
  • The HADOOP_CLIENT_OPTS variable is defined recursively in hadoop-env.sh and appended to HADOOP_OPTS in /usr/lib/hadoop/bin/hadoop.
    • In hadoop-env.sh
...
export HADOOP_CLIENT_OPTS="-Xmx${HADOOP_HEAPSIZE}m $HADOOP_CLIENT_OPTS"
...
    • In /usr/lib/hadoop/bin/hadoop
...
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
...
exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
...

Oozie

Tips and Tricks

Identify available action extensions and versions

  • oozie.service.SchemaService.wf.ext.schemas property of oozie-site.xml (/etc/oozie/conf/oozie-site.xml) file defines the avaialable actions extensions and versions of current Oozie installation

Associates

VizOozie

ZooKeeper

Tips and Tricks

Batch script to run ZooInspector

Create the following run.bat file in %ZOO_HOME%\contrib\ZooInspector directory.

@echo off
cd /D %~dp0
setLocal EnableDelayedExpansion
set CLASSPATH=
for %%G in ("%~dp0\*.jar" "%~dp0\..\..\*.jar" "%~dp0\..\..\lib\*.jar") do (
  set CLASSPATH=!CLASSPATH!;%%G
)

%JAVA_HOME%\bin\java -cp "%CLASSPATH%" org.apache.zookeeper.inspector.ZooInspector

You can just use much more simple script with Java SE 6 or higher.

@echo off
cd /D %~dp0
setLocal EnableDelayedExpansion
set CLASSPATH=%~dp0\*;%~dp0\..\..\*;%~dp0\..\..\lib\*

%JAVA_HOME%\bin\java -cp "%CLASSPATH%" org.apache.zookeeper.inspector.ZooInspector

Using Logback instead of Log4j with ZooKeepr

ZooKeeper uses Log4j internally for logging. To replace Log4j with Logback.

  • Remove Log4j related Jar files on classpath.
  • Place slf4j-api.jar, logback-core.jar, logback-class.jar and log4j-over-slf4j.jar files on classpath.
  • Add -Dzookeeper.jmx.log4j.disable=true option to the JVM startup command.

Associates

Exhibitor

Strom

Tips and Tricks

JVM arguments for worker process

The followings are from the Visual VM. Memory options, debugging option and JMX options are specified via worker.childopts in storm.yaml and others are added by supervisor.

-Xmx256m
-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=26711
-Dcom.sun.management.jmxremote=true
-Dcom.sun.management.jmxremote.port=16711
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dlog.level.root=INFO
-Djava.library.path=storm-local\supervisor\stormdist\CalcTopology-16-1440641838\resources\Windows_Server_2012_R2-x86;storm-local\supervisor\stormdist\CalcTopology-16-1440641838\resources;/usr/local/lib:/opt/local/lib:/usr/lib
-Dlogfile.name=CalcTopology-16-1440641838-worker-6711.log
-Dstorm.home=C:\opt\apache-storm-0.10.0-beta1
-Dstorm.conf.file=
-Dstorm.options=
-Dstorm.log.dir=C:\opt\apache-storm-0.10.0-beta1\logs
-Dlogging.sensitivity=S3
-Dlog4j.configurationFile=C:\opt\apache-storm-0.10.0-beta1\log4j2\worker.xml
-Dstorm.id=CalcTopology-16-1440641838
-Dworker.id=0d3f826d-7550-458c-8c0a-dd716f5c0ec8
-Dworker.port=6711

R

  • http://www.r-project.org/
  • Desc. : a free software programming language and software environment for statistical computing and graphics.
  • License : GPL-2
  • Written-in : C

R Packages

Associates

RHadoop

  • References

StatET

Ambari

Associates

Ambari Shell

Hue

Prototyping

Airline on-time performance

Guideline

  • Naming Convention
    • Hive
      • database name :=
      • table name := [a-z][_a-z0-9]{2,14}
      • column name := [a-z][_a-z0-9]{0,29}

Environment

  • OS account
    • hdfs, mapred, hbase, hive, sqoop, oozie
  • OS groups
    • hadoop : hdfs, mapred, hbase, hive, sqoop, oozie
    • hdfs : hive
  • Raw data on OS file system
    • 644 hdfs hdfs /home/hdfs/samples/flight/rawdata
  • Raw data on Hadoop
    • base directory : 775 hdfs hdfs /samples/flight/raw/
    • 775 hdfs hdfs /samples/flight/raw/flights/data.csv
    • 775 hdfs hdfs /samples/flight/raw/airports/data.csv
    • 775 hdfs hdfs /samples/flight/raw/carriers/data.csv
    • 775 hdfs hdfs /samples/flight/raw/planes/data.csv
  • Processed data on Hadoop
    • base directory : 755 hdfs hfds /samples/flight/
  • Hive database and tables
    • database : flight
    • tables
      • flight (external for /samples/flight/raw/flights/)
      • airport (external for /samples/flight/raw/airports/)
      • carrier (external for /samples/flight/raw/carriers/)
      • plane (external for /samples/flight/raw/planes/)
  • JDBC connection
    • Driver : org.apache.hive.jdbc.HiveDriver
    • URL : jdbc:hive2:// _ip-for-hive-server_ : _port-for-hive-server_ /default
    • User : hive
    • Password : _password-for-hive-user_

Scripts

Data ingress into the HDFS
  1. su - hdfs -c 'for i in {1987..2008}; do wget -nd -nH -P ~/samples/flight/rawdata http://stat-computing.org/dataexpo/2009/$i.csv.bz2; done'
  2. su - hdfs -c "wget -nd -nH -P ~/samples/flight/rawdata http://stat-computing.org/dataexpo/2009/airports.csv"
  3. su - hdfs -c "wget -nd -nH -P ~/samples/flight/rawdata http://stat-computing.org/dataexpo/2009/carriers.csv"
  4. su - hdfs -c "wget -nd -nH -P ~/samples/flight/rawdata http://stat-computing.org/dataexpo/2009/plane-data.csv"
  5.  
  6. su - hdfs -c "bzip2 -d ~/samples/flight/rawdata/*.csv.bz2"
  7.  
  8. # create each year's file in HDFS
  9. su - hdfs -c "hdfs dfs -mkdir -p  /samples/flight/raw/"
  10. for i in {1987..2008}; do (su - hdfs -c "sed '1d' ~/samples/flight/rawdata/$i.csv | hdfs dfs -put - /samples/flight/raw/$i.csv"); done
  11.  
  12. for i in {1987..2008}; do (su - hdfs -c "hdfs dfs -cat /samples/flight/raw/$i.csv | wc -l; wc -l ~/samples/flight/rawdata/$i.csv"); done
  13.  
  14. # create all-in-one file in HDFS
  15. su - hdfs -c 'sed "1d" ~/samples/flight/rawdata/1987.csv | hdfs dfs -put - /samples/flight/raw/flights.csv'
  16. su - hdfs -c 'for i in {1998..2008}; do (sed "1d" ~/samples/flight/rawdata/$i.csv | hdfs dfs -appendToFile - /samples/flight/raw/flights.csv); done'
  17.  
  18. su - hdfs -c 'hdfs dfs -cat /samples/flight/raw/flights.csv | wc -l'
  19. su - hdfs -c 'wc -l ~/samples/flight/rawdata/19??.csv ~/samples/flight/rawdata/20??.csv | tail -n 1'
  20.  
  21. # remove double quotation around string data in files for airports and carriers
  22. su - hdfs -c 'sed -i "s/\"//g" ~/samples/flight/rawdata/airports.csv'
  23. su - hdfs -c 'sed -i "s/\"//g" ~/samples/flight/rawdata/carriers.csv'
  24.  
  25. # crate files for airports, carriers, planes
  26. su - hdfs -c 'sed "1d" ~/samples/flight/rawdata/airports.csv | hdfs dfs -put - /samples/flight/raw/airports/data.csv'
  27. hdfs dfs -cat /samples/flight/raw/airports/data.csv | wc -l   # 3376 is expected
  28.  
  29. su - hdfs -c 'sed "1d" ~/samples/flight/rawdata/carriers.csv | hdfs dfs -put - /samples/flight/raw/carriers/data.csv'
  30. hdfs dfs -cat /samples/flight/raw/carriers/data.csv | wc -l   # 1491 is expected
  31.  
  32. su - hdfs -c 'sed "1d" ~/samples/flight/rawdata/plane-data.csv | hdfs dfs -put - /samples/flight/raw/planes/data.csv'
  33. hdfs dfs -cat /samples/flight/raw/planes/data.csv | wc -l   # 5029 is expected
  34.  
  35. # change permissions for samples
  36. su - hdfs -c 'hdfs dfs -chmod -R 775 /samples/flight/raw'
Hive schema and initial data for the tables
  1. -- check current Hive properties
  2. SET -v;
  3.  
  4. -- creating database
  5. CREATE DATABASE IF NOT EXISTS flight;
  6.  
  7. -- creating tables
  8. CREATE external TABLE IF NOT EXISTS flight.flight(
  9.    YEAR INT comment '1987 ~ 2008',
  10.    MONTH INT comment '1 ~ 12',
  11.    day_of_month INT comment '1 ~ 31',
  12.    day_of_week INT comment '1(Monday) ~ 7(Sunday)',
  13.    dep_time INT comment 'actual departure time (local, hhmm)',
  14.    crs_dep_time INT comment 'scheduled departure time (local, hhmm)',
  15.    arr_time INT comment 'actual arrival time (local, hhmm)',
  16.    crs_arr_time INT comment 'scheduled arrival time (local, hhmm)',
  17.    unique_carrier string comment 'unique carrier code',
  18.    flight_num INT comment 'flight number',
  19.    tail_num string comment 'plane tail number',
  20.    actual_elapsed INT comment 'actual elapsed time in minutes',
  21.    crs_elapsed INT comment 'CRS elapsed time in minutes',
  22.    air_time INT,
  23.    arr_delay INT comment 'arrival delay in minutes',
  24.    dep_delay INT comment 'departure delay in minutes',
  25.    origin string comment 'origin IATA airport code',
  26.    dest string comment 'destination IATA airport code',
  27.    distance INT comment 'in miles',
  28.    taxi_in INT comment 'taxi-in time in minutes',
  29.    taxi_out INT comment 'taxi-out time in minutes',
  30.    cancelled INT comment '1 = yes, 0 = no',
  31.    cancellation_code string comment 'A = carrier, B = weather, C = NAS, D = security',
  32.    diverted INT comment '1 = yes, 0 = no',
  33.    carrier_delay string comment 'number in minutes or NA',
  34.    weather_delay string comment 'number in minutes or NA',
  35.    nas_delay string comment 'number in minutes or NA',
  36.    security_delay string comment 'number in minutes or NA',
  37.    late_aircraft_delay string comment 'number in minutes or NA')
  38. comment 'The data consists of flight arrival and departure details for all commercial flights within USA, from Oct. 1987 to Apr. 2008'
  39. ROW format delimited FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
  40. stored AS textfile
  41. location '/samples/flight/raw/flights/';
  42.  
  43. CREATE external TABLE IF NOT EXISTS flight.airport(
  44.    iata string comment 'the international airport abbreviation code',
  45.    name string comment 'name of the airport',
  46.    city string comment 'city in which airport is located',
  47.    state string comment 'state in which airport is located',
  48.    country string comment 'country in which airport is located',
  49.    latitude DOUBLE comment ' the latitude of the airport',
  50.    longitude DOUBLE comment ' the longitude of the airport')
  51. comment 'The locations of US airports'
  52. ROW format delimited FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
  53. stored AS textfile
  54. location '/samples/flight/raw/airports/';
  55.  
  56. CREATE external TABLE IF NOT EXISTS flight.carrier(
  57.    code string comment 'the carrier code',
  58.    DESC string comment 'description for the carrier')
  59. comment 'Carrier codes with full names'
  60. ROW format delimited FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
  61. stored AS textfile
  62. location '/samples/flight/raw/carriers/';
  63.  
  64. CREATE external TABLE IF NOT EXISTS flight.plane(
  65.    tailnum string comment 'the registration number of the plane, consisting of leading \'N\' and N-number',
  66.    TYPE string comment 'one of Co-Owner, Corporation, Foreign Corporation, Individual, Co-Owned, Partnership',
  67.    manufacturer string comment 'name of the aircraft manufacturer',
  68.    issue_date string comment 'certificate issue date in MM/DD/YYYY format',
  69.    model string comment 'name of the aircraft model',
  70.    STATUS string,
  71.    aircraft_type string comment 'type of aircraft : Balloon, Fixed Wing Multi-Engine, Fixed Wing Single-Engine, Rotorcraft, ...',
  72.    engine_type string comment 'type of aircraft type : 4 Cycle, None, Reciprocating, Turbo-Fan, Turbo-Jet, Turbo-Prop, Turbo-Shaft, ...',
  73.    YEAR INT comment 'year manufactured (not sure)')
  74. comment 'Details for airplanes. For more on schema, refer http://www.faa.gov/licenses_certificates/aircraft_certification/aircraft_registry/media/ardata.pdf'
  75. ROW format delimited FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
  76. stored AS textfile
  77. location '/samples/flight/raw/planes/';
  78.  
  79. -- create time-space partitioned table for flight data
  80. -- de-normalizes some columns from carrier, plane, and airport tables
  81. CREATE TABLE IF NOT EXISTS flight.flight_by_year(
  82.    MONTH INT comment '1 ~ 12',
  83.    day_of_month INT comment '1 ~ 31',
  84.    day_of_week INT comment '1(Monday) ~ 7(Sunday)',
  85.    dep_time INT comment 'actual departure time (local, hhmm)',
  86.    crs_dep_time INT comment 'scheduled departure time (local, hhmm)',
  87.    arr_time INT comment 'actual arrival time (local, hhmm)',
  88.    crs_arr_time INT comment 'scheduled arrival time (local, hhmm)',
  89.    unique_carrier string comment 'unique carrier code',
  90.    carrier_name string comment 'name of the carrier',
  91.    flight_num INT comment 'flight number',
  92.    tail_num string comment 'registration number of the plane',
  93.    plane_mfr string comment 'manufacturer of the plane',
  94.    plane_model string comment 'name of the aircraft model',
  95.    plane_type string comment 'type of aircraft : Balloon, Fixed Wing Multi-Engine, Fixed Wing Single-Engine, Rotorcraft, ...',
  96.    plane_issue_date string comment 'certificate issue date of the plane in MM/DD/YYYY format',
  97.    plane_year INT comment 'manufactured year of the plane(not sure)',
  98.    actual_elapsed INT comment 'actual elapsed time in minutes',
  99.    crs_elapsed INT comment 'CRS elapsed time in minutes',
  100.    air_time INT,
  101.    arr_delay INT comment 'arrival delay in minutes',
  102.    dep_delay INT comment 'departure delay in minutes',
  103.    origin string comment 'origin IATA airport code',
  104.    origin_city string comment 'city of the origin airport',
  105.    origin_state string comment 'state of the origin airport',
  106.    origin_country string comment 'country of the origin airport',
  107.    origin_lat string comment 'latitude of the origin airport',
  108.    origin_long string comment 'longitude of the origin airport',
  109.    dest string comment 'destination IATA airport code',
  110.    dest_city string comment 'city of the destination airport',
  111.    dest_state string comment 'state of the destination airport',
  112.    dest_country string comment 'country of the destination airport',
  113.    dest_lat string comment 'latitude of the destination airport',
  114.    dest_long string comment 'longitude of the destination airport',
  115.    distance INT comment 'in miles',
  116.    taxi_in INT comment 'taxi-in time in minutes',
  117.    taxi_out INT comment 'taxi-out time in minutes',
  118.    cancelled INT comment '1 = yes, 0 = no',
  119.    cancellation_code string comment 'A = carrier, B = weather, C = NAS, D = security',
  120.    diverted INT comment '1 = yes, 0 = no',
  121.    carrier_delay string comment 'number in minutes or NA',
  122.    weather_delay string comment 'number in minutes or NA',
  123.    nas_delay string comment 'number in minutes or NA',
  124.    security_delay string comment 'number in minutes or NA',
  125.    late_aircraft_delay string comment 'number in minutes or NA')
  126. comment 'Data for flight arrival and departure details partitioned by year of the flight'
  127. partitioned BY (YEAR INT comment '1987 ~ 2008')
  128. stored AS textfile;
  129.  
  130. -- checking the integrity of values in flight.tailnum columns
  131. -- 0 is expected when the value is consistent but actually not zero.
  132. SELECT /*+ mapjoin(b) */ COUNT(*)
  133. FROM flight.flight a
  134. WHERE NOT EXISTS(SELECT b.tailnum FROM flight.plane b WHERE b.tailnum = a.tail_num);
  135.  
  136. SET hive.EXEC.dynamic.partition = TRUE;
  137. SET hive.EXEC.dynamic.partition.mode = nonstrict;
  138. INSERT overwrite TABLE flight.flight_by_year
  139. partition (YEAR)
  140. -- The following Oracle style join is supported by Hive 0.13 or higher in the name of Implicit join notation
  141. SELECT /*+ MAPJOIN(b, c, d, e) */ a.MONTH,
  142.    a.day_of_month,
  143.    a.day_of_week,
  144.    a.dep_time,
  145.    a.crs_dep_time,
  146.    a.arr_time,
  147.    a.crs_arr_time,
  148.    a.unique_carrier,
  149.    b.DESC,
  150.    a.flight_num,
  151.    a.tail_num,
  152.    c.manufacturer,
  153.    c.model,
  154.    c.aircraft_type,
  155.    c.issue_date,
  156.    c.YEAR,
  157.    a.actual_elapsed,
  158.    a.crs_elapsed,
  159.    a.air_time,
  160.    a.arr_delay,
  161.    a.dep_delay,
  162.    a.origin,
  163.    d.city,
  164.    d.state,
  165.    d.country,
  166.    d.latitude,
  167.    d.longitude,
  168.    a.dest,
  169.    e.city,
  170.    e.state,
  171.    e.country,
  172.    e.latitude,
  173.    e.longitude,
  174.    a.distance,
  175.    a.taxi_in,
  176.    a.taxi_out,
  177.    a.cancelled,
  178.    a.cancellation_code,
  179.    a.diverted,
  180.    a.carrier_delay,
  181.    a.weather_delay,
  182.    a.nas_delay,
  183.    a.security_delay,
  184.    a.late_aircraft_delay,
  185.    a.YEAR
  186. FROM flight.flight a LEFT OUTER JOIN flight.carrier b ON (a.unique_carrier = b.code)
  187.    LEFT OUTER JOIN flight.plane c ON (a.tail_num = c.tailnum)
  188.    LEFT OUTER JOIN flight.airport d ON (a.origin = d.iata)
  189.    LEFT OUTER JOIN flight.airport e ON (a.dest = e.iata);
  190.  
  191. SET hive.execution.engine = tez;
  192. SELECT a.YEAR, a.MONTH, a.carrier_name, COUNT(flight_num) AS flights
  193.    COUNT(IF(a.dep_delay > 0, TRUE, NULL)) AS dep_delay_cnt,
  194.    avg(a.dep_delay) AS dep_delay_avg,
  195.    MAX(a.dep_delay) AS dep_delay_max,
  196.    COUNT(IF(a.arr_delay > 0, TRUE, NULL)) AS arr_delay_cnt,
  197.    avg(a.arr_delay) AS arr_delay_avg,
  198.    MAX(a.arr_delay) AS arr_delay_max
  199. FROM flight.flight_by_year a
  200. WHERE a.cancelled = 0  -- not cancelled
  201. AND a.diverted = 0    -- not diverted
  202. AND a.dep_delay IS NOT NULL
  203. AND a.arr_delay IS NOT NULL
  204. GROUP BY a.YEAR, a.MONTH, a.carrier_name;