Tuesday, June 6, 2023

Walk through of a bug in the Volume Shadow Copy driver - volsnap.sys

Background

An issue was detected with shadow copies when using Arsenal Image Mounter as part of an analysis of a disk image. Intensive debugging followed and after some time it was clear that the bug was in volsnap.sys and Arsenal Image Mounter had absolutely nothing to do about the issue. The main challenge during the investigation was that the issue could not be reliably reproduced, and only in rare cases until it was further narrowed down.

What

The bug in the Volume Shadow Copy driver is one that anyone dealing with shadow copies should be aware of. It is important because the bug will delete the shadow copies on the volume and during a system analysis this may affect your work if not detected. The bug has only been observed on Windows 11 in the current version 10.0.22621.1 of the driver. For completeness here is the full set of versions I have tested: 

The bug was reported to Microsoft on May 11 2023.

Have made a GitHub project at Volsnap Bug Content with some content to accompany this text. References to it are made where applicable.

Symptom

After bringing a volume online there are no shadow copies visible. The volume must be mounted writable. Two error messages are written to the event log and the volume shadow copies are deleted.

Error 1


Event ID: 15
Message: The shadow copies of volume E: were aborted because of insufficient paged heap.




Error 2


Event ID: 27
Message: The shadow copies of volume E: were aborted during detection because a critical control file could not be opened.





Both of the two example events have been exported to xml with all details included and are available as part of the content provided at Volsnap Bug Content. We will further down see how the exact (and slightly confusing) error messages are constructed.

Due to the non deterministic nature of the bug it seems to happen randomly when a volume is brought online. That means the bug may be triggered the first time a volume is mounted, or it may happen on any later re-mount of the volume. See the end for how to reproduce.

Workaround

When the target volume is natively connected and mounted there is no real workaround since the shadow copies on the volume will be invalidated on the source volume when the bug is triggered. However, users of Arsenal Image Mounter that are working with disk images can mount an image in write temporary mode. If the bug is triggered, the image can be dismounted and then remounted which will effectively reset the volume back to the original state.

Deep dive

During the initial phase when a volume with shadow copies is brought online, the volsnap!VspActivateSnapshots is called and volsnap!VspCreateSnapshotExtension function is called once for every shadow copy. One of the things this function is responsible for is to call the kernel to create the device such as \Device\HarddiskVolumeShadowCopy236. 

Relevant section seen in IDA:



Then it will create the symbolic link such as \GLOBAL??\HarddiskVolumeShadowCopy236.

Relevant section seen in IDA:



Following this in WinDbg I've verified both the kernel calls  nt!IoCreateDevice and nt!IoCreateSymbolicLink will succeed with NTSTATUS 0 (STATUS_SUCCESS).

Towards the end of volsnap!VspCreateSnapshotExtension the worker thread is started after calling volsnap!VspCreateWorkerThread which ultimately prepares all the low level stuff for the shadow copy to be fully working.

First error in ZwOpenFile

 # Child-SP          RetAddr           Call Site
00 ffff9e88`31efb940 fffff801`5f66c2d5 volsnap!VspBypassIoExtension+0xd6
01 ffff9e88`31efbac0 fffff801`5f697777 volsnap!VspBypassIoExtensionDisable+0x9
02 ffff9e88`31efbaf0 fffff801`5f670a8c volsnap!VspSetIgnorableBlocksInBitmapWorker+0x1b7
03 ffff9e88`31efbbb0 fffff801`59c0dc67 volsnap!VspWorkerThread+0xdc
04 ffff9e88`31efbbf0 fffff801`59e2e304 nt!PspSystemThreadStartup+0x57
05 ffff9e88`31efbc40 00000000`00000000 nt!KiStartSystemThread+0x34

The first error happen in volsnap!VspBypassIoExtension when calling nt!ZwOpenFile to create a handle to the \Device\HarddiskVolumeShadowCopy236 which result in rax=00000000c000000e. This is strange since I verified the GLOBAL directory has correctly been updated with the symbolic link at the time of this call. The bad NTSTATUS is handled in volsnap!VspSetIgnorableBlocksInBitmapWorker which prepares the content for the event log entry.


The root cause of this first error is unknown, however since the unaffected versions of volsnap.sys does not contain any of the three VspBypassIo* functions it might be a clue:
  • VspBypassIoCommon
  • VspBypassIoExtension
  • VspBypassIoExtensionDisable
Side note: 
In the absence of an error in volsnap!VspBypassIoExtension a call is made to volsnap!VspBypassIoCommon which essentially calls ZwFsControlFile using the same handle and control code 90448h. The control code 90448h resolves to this mysterious entry in ntifs.h of WDK for version 10.0.22621.0:



Due to this first error, a larger structure (VS_VOLUME_EXTENSION->VS_FILTER_EXTENSION ?) containing lots of details about the volume and shadow copies will be out of sync with regards to to the expected number of handles. This is what ultimately causes the second error.

Second error in ZwOpenFile

 # Child-SP          RetAddr           Call Site
00 ffff9e88`32f645d0 fffff801`5f67becc volsnap!VspOpenFilesAndValidateSnapshots+0x290
01 ffff9e88`32f646e0 fffff801`59c162c5 volsnap!VspAdjustBitmap+0x4cc
02 ffff9e88`32f64a00 fffff801`59c0dc67 nt!ExpWorkerThread+0x155
03 ffff9e88`32f64bf0 fffff801`59e2e304 nt!PspSystemThreadStartup+0x57
04 ffff9e88`32f64c40 00000000`00000000 nt!KiStartSystemThread+0x34

The second error happens in nt!ZwOpenFile inside volsnap!VspOpenFilesAndValidateSnapshots while trying to create a handle to shadow copy file such as;
\Device\HarddiskVolume47\System Volume Information\{9b623041-e719-11ed-8832-80fa5b5eb89e}{3808876b-c176-4e48-b7ae-04046e6cc752}

Now, this seems to be a consequence of a series of unfortunate events.

1.
After the first error is detected in volsnap!VspSetIgnorableBlocksInBitmapWorker a call is made to volsnap!VspDestroyAllSnapshots. Because of the nature of the first error, this call will fail.

2.
Because volsnap!VspDestroyAllSnapshots fails, the code takes a jump here;

Then jumps to:

This is the starting point for the stack of the thread for the second error as shown above.

3.
While in volsnap!VspOpenFilesAndValidateSnapshots and iterating the structure, calls are being done to nt!ObfReferenceObject on every shadow copy's object such as HarddiskVolumeShadowCopy36. Now, due to the first error the expected number of shadow copies are out of sync, and when there are more than 1 shadow copy the function will eventually try to associate multiple (shadow copy) files on the file system to the same object.  When this scenario is reached, the nt!ZwOpenFile fails with rax=00000000c0000034.

The specific error is eventually detected (though it seems too late) and the code reaches this location where the detail for the event log entry is prepared here:

PAGELK:00000001C003BF2F                 mov     rax, 0AD00011381h
4.
Then, the code ends up calling volsnap!VspDestroyAllSnapshots a second time when most of it is already in an awful state.

5.
The combined result of all the above issues, is that volsnap!VspDestroyAllSnapshots will leave ghost shadow copy files on the file system that can't be deleted.


In the above screenshot the shadow copy master file and the oldest shadow copy has been deleted, leaving the nine most recent shadow copy files as hidden ghost files.

The only positive thing for the users in the scenario where the workaround is not possible, is that some of the data may be recoverable with some effort.

Decoding the event log entry details


Error 1

PAGELK:00000001C00477A3                 mov     rax, 5F0001211Ah
First we split up the 5F0001211A value into 5F-0001-211A and then we can can give the individual pieces some interpretation (decimal conversion and text mapping):
  • 5F = 95 (SourceTag)
  • 0001 = 1 (SourceFileID)
  • 211A = 8474 (SourceLine)
Here is the relevant section of the exported event:


The NTSTATUS value 0xC000000E (STATUS_NO_SUCH_DEVICE) which produces the error message "The shadow copies of volume E: were aborted because of insufficient paged heap." makes little sense here and is just confusing.

Error 2

PAGELK:00000001C003BF2F                 mov     rax, 0AD00011381h
We do the same and split AD00011381 into AD-0001-1381 and then we can give the individual pieces the following interpretation:
  • AD = 173 (SourceTag)
  • 0001 = 1 (SourceFileID)
  • 1381 = 4993 (SourceLine)
Here is the relevant section of the exported event:


The NTSTATUS 0xC0000034 (STATUS_OBJECT_NAME_NOT_FOUND) which produces the error message "The shadow copies of volume E: were aborted during detection because a critical control file could not be opened." makes some more sense here since a handle to the shadow copy file could not be created.

With this information you have some good hints on how to locate such event log error messages in the code disassembly.

Reproducing the bug

At Volsnap Bug Content a few handcrafted vhd's have been prepared with a minimal file system and shadow copies containing just small single files to keep it simple. Execute the included powershell script and pass arguments for the vhd file and vsc count. The script will enter a loop to mount and dismount the vhd. Eventually the bug will get triggered with the correct version of the driver and the script will stop when the vsc count has become unexpected.