Apache Chukwa Programmers Guide
At the core of Apache Chukwa is a flexible system for collecting and processing monitoring data, particularly log files. This document describes how to use the collected data. (For an overview of Apache Chukwa data model and collection pipeline, see the Design Guide.)
In particular, this document discusses the Apache Chukwa archive file formats, the demux and archiving mapreduce jobs, and the layout of Apache Chukwa storage directories.
Agent REST API
Apache Chukwa Agent offers programmable API to control Agent adaptors for collecting data from remote sources, or setup a listening port for incoming data stream. Usage guide and examples are documented in Agent REST API doc.
Demux
A key use for Apache Chukwa is processing arriving data, in parallel, using Apache Chukwa Demux. The most common way to do this is using Apache Chukwa demux framework. As data flows through Chukwa, the demux parsers are often the first user defined function to process data.
By default, Apache Chukwa will use the default TsProcessor. This parser will try to extract the real log statement from the log entry using the ISO8601 date format. If it fails, it will use the time at which the chunk was written to disk (agent timestamp).
Demux Data To HBase
Demux parsers can be configured to run in $CHUKWA_HOME/etc/chukwa/chukwa-demux-conf.xml. See Pipeline configuration guide. HBaseWriter is not a real map reduce job. It is designed to reuse Demux parsers for extraction and transformation purpose. There are some limitations to consider before implementing Demux parser for loading data to HBase. In MapReduce job, mutliple values can be merged and group into a key/value pair in shuffle/combine and merge phases. This kind of aggregation is unsupported by Demux in HBaseWriter because the data are not merged in memory, but send to HBase. HBase takes the role of merging values into a record by primary key. Therefore, Demux reducer parser is not invoked by HBaseWriter.
For writing a demux parser that works with HBaseWriter, there are two piece information to encode to Demux parser. First, HBase table name to store the data. This is encoded in Demux parser by annotation. Second, the column family name to store the data is encoded in the ReducerType of the Demux Reducer parser.
Example of Demux mapper parser
@Tables(annotations={ @Table(name="SystemMetrics",columnFamily="cpu) }) public class SystemMetrics extends AbstractProcessor { @Override protected void parse(String recordEntry, OutputCollector<ChukwaRecordKey, ChukwaRecord> output, Reporter reporter) throws Throwable { ... buildGenericRecord(record, null, cal.getTimeInMillis(), "cpu"); output.collect(key, record); } }
In this example, the data collected by SystemMetrics parser is stored into "SystemMetrics" HBase table, and column family is stored to "cpu" column family.
HICC REST API
HICC visualization API offers simple API to compose dashboard, and charting widgets. Data visualization API offers features for end user to interact with data in the final product format. They are designed to display and summarize data for human interaction. HICC usage guide and examples are documented in HICC REST API doc.