A Blog by Jeremy Carroll

Hadoop Fair Scheduler Monitoring - Part 2

In my first blog post I went over setting up a simple fair scheduler monitoring script using mechanize to scrape stats from the JobTracker page. Now that all the metrics are in our graphite system (In this case, I used graphite due to nice front-end vis) I wanted to show how we can put these metrics to work.

The goal was to measure requests for resources (Map slots / Reduce slots) for different job queues. I can then determine certain parts of the day where there is a lot of contention for resources to see if a job can be rescheduled to a different time when there is more slots available. Another option would be to change the fair-scheduler minShare, preemption, and weight settings to help shape the SLA of running jobs. Before we can start setting these policies it would be nice to know map / reduce slot demand over a timeline. Also look at ‘running’ map / reduce slots to see what resources the scheduler gave during those time periods. Here are some questions the metrics can answer for us.

  • During what times of the day is the cluster heavily over-subscribed?
    • Is the cluster using more map or reduce slots during these timeframes?
  • What queues / users are requesting resources?
    • How many jobs are they running?
    • How many Map / Reduce slots are they requesting?
    • How fast are they completing map / reduce slots?

I put together a series of visualizations to help debug an oversubscribed Hadoop cluster. Using this information was invaluable in determining winners and losers of cluster resources. It did not help debug individual MapReduce jobs. You would have to look at individual job performance such as spilling to disk, not using a combiner or secondary sort, or sending too much data uncompressed over the wire, etc… But it does help determine overall cluster utilization.

Demand The image here represents map + reduce slots requested by running jobs over a 24 hour period. As jobs complete, the total number drops. When you start to see solid straight lines (plataus) in the image, it means that no new jobs have been added or removed from the scheduler for this pool. The current jobs are just taking a while to complete. Nice to get a big picture of how many tasks are being run by the cluster by pool.

Jobs This graph shows how many jobs are currently in the fair scheduler by pool. Easy to see bursts of activity from end users. The primary purpose of this is to help tune the maxRunningJobs parameters per queue. It’s also nice to see day / night patterns of activity during for processing cycles. See if somebody can shift a job a couple hours to take advantage of lower utilization periods.

Map Slots This metric shows the amount of map slots running at any given time. It’s great to determine what pools are taking the majority of the resources. Also shows if there are long running jobs taking the majority of the queue for long periods of time. With preemption policies enabled, you can also see slots getting killed to free up space for a pool to meet it’s minShare / fairShare.

Map Speed This is one of my favorite graphs from this tool. You can see the total number of map slots requested versus the number of map tasks completed. Basically shows you job progress at a high level, and how fast the tasks are completing. Since this does not take into account new jobs entering and adding tasks, it’s a good high level gut check.

Maps Outstanding Using Graphite’s function for a difference of two series, this is the same graph as the image before, but by dividing the total map slots by maps completed. This will give you the number of remaining map tasks to be processed.

Queue As with most of our jobs the map side completes quickly, but blocks for long periods of time waiting for data to be shuffled across the network. So we usually have contention on reduce slots, not map slots. Using some of graphite draw in second Y axis features I can look at the number of jobs in a queue versus the number of reduces (completed + total). Flat lines lets me know a long running reduce task, or a lot of intermediate output.

Reduce Slots I currently have a minShare policy on a queue that has a strict SLA. It was nice to see that the share was being met as shown by the blue line drawn. This queue always gets its slots via preemption. You can also see we have long periods of contention as 100% of the reduce slots are taken for hours at a time. Using this graph I am currently tuning the cluster to give up map slots in favor of reduce slots in addition to increasing the block size as map tasks are completing very quickly.

Queue Info This is the same graph as the map side by dividing the total number of reduce slots requested by reduces completed. Cross referencing this with our other network graphs shows a lot of intermediate data going across the network.

All in all I have had good success with a very simple metrics collection tool to help me identify some bottlenecks on my cluster setup. This is what I have come up with so far.

  • Increase the number of reduce slots.Checking other metrics shows we have network bandwidth + disk IOPS to spare.
  • Increase the amount of work done by a map task as they are completing very quickly.
  • Shuffle some job start times to take advantage of idle resources.

Let me know if anybody has a chance to use the tool to identify issues with their cluster. I’d love to hear some feedback.