Data Model
Apache Chukwa Adaptors emit data in Chunks. A Chunk is a sequence of bytes, with some metadata. Several of these are set automatically by the Agent or Adaptors. Two of them require user intervention: cluster name and datatype. Cluster name is specified in conf/chukwa-agent-conf.xml, and is global to each Agent process. Datatype describes the expected format of the data collected by an Adaptor instance, and it is specified when that instance is started.
The following table lists the Chunk metadata fields:
Field | Meaning | Source |
Source | Hostname where Chunk was generated | Automatic |
Cluster | Cluster host is associated with | Specified by user in agent config |
Datatype | Format of output | Specified by user when Adaptor started |
Sequence ID | Offset of Chunk in stream | Automatic, initial offset specified when Adaptor started |
Name | Name of data source | Automatic, chosen by Adaptor |
Conceptually, each Adaptor emits a semi-infinite stream of bytes, numbered starting from zero. The sequence ID specifies how many bytes each Adaptor has sent, including the current chunk. So if an adaptor emits a chunk containing the first 100 bytes from a file, the sequenceID of that Chunk will be 100. And the second hundred bytes will have sequence ID 200. This may seem a little peculiar, but it's actually the same way that TCP sequence numbers work.
Adaptors need to take sequence ID as a parameter so that they can resume correctly after a crash, and not send redundant data. When starting adaptors, it's usually save to specify 0 as an ID, but it's sometimes useful to specify something else. For instance, it lets you do things like only tail the second half of a file.
HBase Schema
Metrics
chukwa table stores time series data.
Row Key
Day | Metric MD5 | Source MD5 | |
Size | 2 | 6 | 6 |
Row key is composed of 14 bytes data. First 2 bytes are day of the year. The next 6 bytes are md5 signature of metrics name. The last 6 bytes are md5 signature of data source. This arrangement helps Apache Chukwa to partition data evenly across regions base on time.
This arrangement provides a good condensed store for data of the same day for the same source.
Metadata
chukwa_metadata table is designed to store point lookup data. For example, small amount of data to describe the metric name mapping for chukwa table. It is also used to store JSON blob of dashboard data.
Row Key
Row Key | Description |
[Metrics Group] | Metrics Group Name, this allows to fetch all metrics name from the group can be fetched from loading the row key. |
chart_meta | All charts are stored in this row. |
dashboard_meta | All dashboard are stored in this row. |
widget_meta | All widgets are stored in this row. |
Special Row
chart_meta | Cell contains the rendering option and metric series name in a JSON blob |
dashboard_meta | Cell describes one dashboard view |
widget_meta | Cell describes title and URL of a dashboard widget |
Column Family
Column Family | Description |
k | Key, associated with a fixed structure for describing key types and md5 signature of the key used in chukwa table. |
c | column for storing JSON blob for special rows. This column is used to store dashboard, chart, and widget metadata. |
Key Types for k column Family, the current supported key types are:
Type | Description |
metric | This key is a metric name. |
source | This key is a source name. |