UPDATE – 2/8/2012
Edit – 2/9/2012 (changed the RAM to reflect an upgrade to 16GB. I incorrectly stated 18GB)
We purchased a new more powerful SAN (HP P2000) and migrated our SQL Server over to it. I was excited to test this again on the new SAN to see if it made any improvement by having a faster disk sub-system. I took a 60GB file this time from one LUN and copied it over to another LUN into a compressed folder. We past the 36GB (previously the maximum size file we could copy in this scenario) mark and there were no signs of the Dirty Pages getting close to the threshold. We past 40GB….we past 50GB…and as I was about to stop the process I noticed that the Dirty Pages began to rise and rise quickly. It got to 57GB this time before we encountered the same issues as we previously did. Discussing this with a co-worker, I finally saw the light. We not only changed our SAN but we increased the memory of the server from 6GB to 16GB. This issue appears to have a direct link to the amount of memory on the server. It appears when compressing and decompressing files that this process is happening in memory and when the cache fills up…that’s it. You need to have enough memory on the machine to handle the compression/decompression…period. So now our new threshold is 57GB with 16GB of memory.
In short, we had an issue that was caused by copying a large file to a compressed folder on a different LUN on our MSA1000 SAN. Basically the Windows Internal Cache Manager, which is a subsystem of the Memory Manager, gets filled up faster than the data can be written to disk. Since the Memory Manager/Cache Manager are GLOBAL to the system, this caused EVERYTHING on the system to come to a crawl while the uncommitted writes attempted to complete.
This Microsoft article describes the issue that we had but unfortunately the fix doesn’t work in our situation. The issue that is described in the article is related to copying a large file from a fast disk to a slower disk. In our case, the source disk and destination disk are both the exact same speed. What was making the writes slower at the new location was the compression. Even though the file is compressed at the source location, during the copy, the file is decompressed and then re-compressed. Nothing I could find directly addressed this scenario.
To prove this issue was happening, I installed the Debugging Tools for Microsoft Windows on the affected server and then ran a !defwrites like the article suggested. I let the copy start and kept running !defwrites every few minutes and sure enough the CcTotalDirtyPages continued to increase until it hit the Threshold. You can see this below. At this point the throttling kicked in and began slowing everything down so that the writes could attempt to finish.
*** Cache Write Throttle Analysis ***
CcTotalDirtyPages: 764841 ( 3059364 Kb)
CcDirtyPageThreshold: 764834 ( 3059336 Kb)
MmAvailablePages: 413392 ( 1653568 Kb)
MmThrottleTop: 450 ( 1800 Kb)
MmThrottleBottom: 80 ( 320 Kb)
MmModifiedPageListHead.Total: 401989 ( 1607956 Kb)
CcTotalDirtyPages >= CcDirtyPageThreshold, writes throttled
Check these thread(s): CcWriteBehind(LazyWriter)
Check critical workqueue for the lazy writer, !exqueue 16
Cc Deferred Write list: (CcDeferredWrites)
File: fffffadf6c02ff40 Event: fffffadf5d0a7548
File: fffffadf6c0c2600 Event: fffffadf5cefc548
File: fffffadf6c715370 Event: fffffadf5b819708
The workaround in our case was to ensure that compression was turned off on the folders in the destination that contained files over about 36GB in size. Testing showed that any more than this would cause insufficient resources during the write operation (found by running Process Monitor) and eventually bring the server to its knees.
Following are some other related links that I found during research of this problem.