Why HDFS Data Becomes Unbalanced
Data stored in HDFS can become unbalanced for any of the following reasons:
- Addition of Datanodes - When new datanodes are added to a cluster, newly created blocks are written to these datanodes from time to time. The existing blocks are not moved to them without using the HDFS Balancer. 
- Behavior of the Client - In some cases, a client application might not write data uniformly across the datanode machines. A client application might be skewed in writing data, and might always write to some particular machines but not others. HBase is an example of such a client application. In other cases, the client application is not skewed by design, for example, MapReduce or YARN jobs. - The data is skewed so that some of the jobs write significantly more data than others. When a Datanode receives the data directly from the client, it stores a copy to its local storage for preserving data locality. The datanodes receiving more data generally have higher storage utilization. 
- Block Allocation in HDFS - HDFS uses a constraint satisfaction algorithm to allocate file blocks. Once the constraints are satisfied, HDFS allocates a block by randomly selecting a storage device from the candidate set uniformly. For large clusters, the blocks are essentially allocated randomly in a uniform distribution, provided that the client applications write data to HDFS uniformly across the datanode machines. Uniform random allocation might not result in a uniform data distribution because of randomness. This is generally not a problem when the cluster has sufficient space. The problem becomes serious when the cluster is nearly full. 

