Reason behind using Manhattan distance
I think it is due to the noise rejection. For me it is the ratio between how a small change in a lot of dimensions is weighted compared to a big change in a single dimension, as you demonstrated above. You can also think about it like the difference between L1 and L2 distance, how they would affect a minimization problem. L1 rejects large noises, and is the most robust you can get while still maintaining a convex search space. L2 has a gradient that gets larger the bigger the distance, so dimensions with more error are effectively weighted higher, and weighted higher than just proportional to the amount of error.