HDFS File System Structure
The general layout of the Chukwa filesystem is as follows.
/chukwa/ archivesProcessing/ dataSinkArchives/ demuxProcessing/ finalArchives/ logs/ postProcess/ repos/ rolling/ temp/
Raw Log Collection and Aggregation Workflow
What data is stored where is best described by stepping through the Chukwa workflow.
- Collectors write chunks to logs/*.chukwa files until a 64MB chunk size is reached or a given time interval has passed.
- logs/*.chukwa
- Collectors close chunks and rename them to *.done
- from logs/*.chukwa
- to logs/*.done
- DemuxManager checks for *.done files every 20 seconds.
- If *.done files exist, moves files in place for demux processing:
- from: logs/*.done
- to: demuxProcessing/mrInput
- The Demux MapReduce job is run on the data in demuxProcessing/mrInput.
- If demux is successful within 3 attempts, archives the completed files:
- from: demuxProcessing/mrOutput
- to: dataSinkArchives/[yyyyMMdd]/*/*.done
- Otherwise moves the completed files to an error folder:
- from: demuxProcessing/mrOutput
- to: dataSinkArchives/InError/[yyyyMMdd]/*/*.done
- If *.done files exist, moves files in place for demux processing:
- PostProcessManager wakes up every few minutes and aggregates, orders and de-dups record files.
- from: postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[HH].R.evt
- to: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[HH]_[N].[N].evt
- HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group 5 minute logs to hourly.
- from: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[mm].[N].evt
- to: temp/hourlyRolling/[clusterName]/[dataType]/[yyyyMMdd]
- to: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_HourlyDone_[yyyyMMdd]_[HH].[N].evt
- leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/rotateDone/
- DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs to daily.
- from: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_[yyyyMMdd]_[HH].[N].evt
- to: temp/dailyRolling/[clusterName]/[dataType]/[yyyyMMdd]
- to: repos/[clusterName]/[dataType]/[yyyyMMdd]/[dataType]_DailyDone_[yyyyMMdd].[N].evt
- leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/rotateDone/
- ChukwaArchiveManager every half hour or so aggregates and removes dataSinkArchives data using M/R.
- from: dataSinkArchives/[yyyyMMdd]/*/*.done
- to: archivesProcessing/mrInput
- to: archivesProcessing/mrOutput
- to: finalArchives/[yyyyMMdd]/*/chukwaArchive-part-*