So today I received a request to help debug fair scheduler performance on one of our large Hadoop clusters. Normally this is the point where I would point to installing a tool like Cloudera Manager, but we do not have CM running anywhere within our environment. So I took a quick look around GitHub to see if anybody has written any scripts to monitor the fair scheduler allocations and found nothing. We currently have monitoring on the Hadoop / HDFS level, but are lacking visibility at the individual scheduler / pool level. Questions arise such as this that I cannot answer without a cool made-up story.
- My job ran very slow last night, can you take a look at the cluster to see if anything is wrong?
- This set of Oozie jobs normally takes 2 hours, but over the past few days it has been taking 4 hours easy run, why is that?
- Can you check the network? Looks like my jobs have been running very slowly.
Now at this point, I check the regular cluster health with our fleet of monitoring tools based on Graphite / OpenTSDB. Looking at HDFS health, I see no instances of failed datanodes, 100 mbit links, errors in the log files, et al. While looking at basic hadoop-metrics from our logging context, I see that the cluster is almost always 100% utilized on map slots / reduce slots. The next obvious question is ‘Who is running jobs stealing my resources?’. Before I can start to create fair-share policies, and pick winners and losers of precious cluster resources via preemption I need to know demand. I need to know how many jobs are running in each pool, and how many slots they require to finish there tasks. I would also love to monitor preemption requests to see when pools start killing other tasks to meet their fair-share or min-share. I set out to create a simple tool to query the http://jobtracker:50030/scheduler?advanced web interface on a timer, and send this metadata to Graphite / OpenTSDB on a real-time basis. I could then create visualizations to see demand and allocation to help craft fair share policies. It’s not perfect as it does not look at individual job performance (Input splits, min / max task completion time) but is really helpful on a high level.
I wanted something quick and easy to get done, so I took about 2 hours to create a Ruby Mechanize script to screen scrape the jobtracker fair-scheduler page. I then turn the output into a KeyValue format that I can use with OpenTSDB / Graphite. I did not want to create a lot of unique keys due to issues with Graphite creating a Whisper database per unique point. So capturing job-name, task-id is unacceptable. I instead aggregate metrics by user, or pool. In the case of our cluster job pool names are users. So I aggreagte the metrics by pool name, then send off to our visualiation system for further planning.
I created a project above if you would like to hack on the code. Currently I am using Diamond https://github.com/BrightcoveOS/Diamond to schedule checks via the UserScriptsCollector for ruby programs. It wants Key + Value, and fills in the date for you based on your scheduler. I can then send the metrics to both OpenTSDB + Graphite with the same system via Handlers. Below is some graphs of the things you can do with this information.
As you can see by the graph, it details a nice break down of slots utilization by fair scheduler pool. In this image, 3 different pools are racing for resources as the cluster is 100% utilized. You can dive into other metrics such as total tasks scheduled (map / reduce) vs. resources available at that time. Let me know if you find this tool helpful. Pull requests are always welcome.