Apache Chukwa -

Data Model

Apache Chukwa Adaptors emit data in Chunks. A Chunk is a sequence of bytes, with some metadata. Several of these are set automatically by the Agent or Adaptors. Two of them require user intervention: cluster name and datatype. Cluster name is specified in conf/chukwa-agent-conf.xml, and is global to each Agent process. Datatype describes the expected format of the data collected by an Adaptor instance, and it is specified when that instance is started.

The following table lists the Chunk metadata fields:

Field	Meaning	Source
Source	Hostname where Chunk was generated	Automatic
Cluster	Cluster host is associated with	Specified by user in agent config
Datatype	Format of output	Specified by user when Adaptor started
Sequence ID	Offset of Chunk in stream	Automatic, initial offset specified when Adaptor started
Name	Name of data source	Automatic, chosen by Adaptor

Conceptually, each Adaptor emits a semi-infinite stream of bytes, numbered starting from zero. The sequence ID specifies how many bytes each Adaptor has sent, including the current chunk. So if an adaptor emits a chunk containing the first 100 bytes from a file, the sequenceID of that Chunk will be 100. And the second hundred bytes will have sequence ID 200. This may seem a little peculiar, but it's actually the same way that TCP sequence numbers work.

Adaptors need to take sequence ID as a parameter so that they can resume correctly after a crash, and not send redundant data. When starting adaptors, it's usually save to specify 0 as an ID, but it's sometimes useful to specify something else. For instance, it lets you do things like only tail the second half of a file.

HBase Schema

Metrics

chukwa table stores time series data.

Row Key

	Day	Metric MD5	Source MD5
Size	2	6	6

Row key is composed of 14 bytes data. First 2 bytes are day of the year. The next 6 bytes are md5 signature of metrics name. The last 6 bytes are md5 signature of data source. This arrangement helps Apache Chukwa to partition data evenly across regions base on time.

This arrangement provides a good condensed store for data of the same day for the same source.

Column Family

The column family format for Apache Chukwa table are:

Column Family	Description
t	Time series data. Column name is timestamp. Value is a string
a	Annotation, string tags associated with time series data.

Metadata

chukwa_metadata table is designed to store point lookup data. For example, small amount of data to describe the metric name mapping for chukwa table. It is also used to store JSON blob of dashboard data.

Row Key

Row Key	Description
[Metrics Group]	Metrics Group Name, this allows to fetch all metrics name from the group can be fetched from loading the row key.
chart_meta	All charts are stored in this row.
dashboard_meta	All dashboard are stored in this row.
widget_meta	All widgets are stored in this row.

Special Row

chart_meta	Cell contains the rendering option and metric series name in a JSON blob
dashboard_meta	Cell describes one dashboard view
widget_meta	Cell describes title and URL of a dashboard widget

Column Family

Column Family	Description
k	Key, associated with a fixed structure for describing key types and md5 signature of the key used in chukwa table.
c	column for storing JSON blob for special rows. This column is used to store dashboard, chart, and widget metadata.

Key Types for k column Family, the current supported key types are:

Type	Description
metric	This key is a metric name.
source	This key is a source name.

Table of Contents

Miscellaneous

Data Model

HBase Schema

Metrics

Row Key

Column Family

Metadata

Row Key

Special Row

Column Family