This project has retired. For details please refer to its Attic page.
Apache Chukwa -

Data Model

Apache Chukwa Adaptors emit data in Chunks. A Chunk is a sequence of bytes, with some metadata. Several of these are set automatically by the Agent or Adaptors. Two of them require user intervention: cluster name and datatype. Cluster name is specified in conf/chukwa-agent-conf.xml, and is global to each Agent process. Datatype describes the expected format of the data collected by an Adaptor instance, and it is specified when that instance is started.

The following table lists the Chunk metadata fields:

Field Meaning Source
Source Hostname where Chunk was generated Automatic
Cluster Cluster host is associated with Specified by user in agent config
Datatype Format of output Specified by user when Adaptor started
Sequence ID Offset of Chunk in stream Automatic, initial offset specified when Adaptor started
Name Name of data source Automatic, chosen by Adaptor

Conceptually, each Adaptor emits a semi-infinite stream of bytes, numbered starting from zero. The sequence ID specifies how many bytes each Adaptor has sent, including the current chunk. So if an adaptor emits a chunk containing the first 100 bytes from a file, the sequenceID of that Chunk will be 100. And the second hundred bytes will have sequence ID 200. This may seem a little peculiar, but it's actually the same way that TCP sequence numbers work.

Adaptors need to take sequence ID as a parameter so that they can resume correctly after a crash, and not send redundant data. When starting adaptors, it's usually save to specify 0 as an ID, but it's sometimes useful to specify something else. For instance, it lets you do things like only tail the second half of a file.

HBase Schema

Metrics

chukwa table stores time series data.

Row Key

Day Metric MD5 Source MD5
Size 2 6 6

Row key is composed of 14 bytes data. First 2 bytes are day of the year. The next 6 bytes are md5 signature of metrics name. The last 6 bytes are md5 signature of data source. This arrangement helps Apache Chukwa to partition data evenly across regions base on time.

This arrangement provides a good condensed store for data of the same day for the same source.

Column Family

The column family format for Apache Chukwa table are:

Column Family Description
t Time series data. Column name is timestamp. Value is a string
a Annotation, string tags associated with time series data.

Metadata

chukwa_metadata table is designed to store point lookup data. For example, small amount of data to describe the metric name mapping for chukwa table. It is also used to store JSON blob of dashboard data.

Row Key

Row Key Description
[Metrics Group] Metrics Group Name, this allows to fetch all metrics name from the group can be fetched from loading the row key.
chart_meta All charts are stored in this row.
dashboard_meta All dashboard are stored in this row.
widget_meta All widgets are stored in this row.

Special Row

chart_meta Cell contains the rendering option and metric series name in a JSON blob
dashboard_meta Cell describes one dashboard view
widget_meta Cell describes title and URL of a dashboard widget

Column Family

Column Family Description
k Key, associated with a fixed structure for describing key types and md5 signature of the key used in chukwa table.
c column for storing JSON blob for special rows. This column is used to store dashboard, chart, and widget metadata.

Key Types for k column Family, the current supported key types are:

Type Description
metric This key is a metric name.
source This key is a source name.