Windows Subystem For Linux – File Reads

I’ve read a bit about the way that Microsoft has implemented the Windows Subsystem for Linux (WSL). A one sentence summary: all system calls in a WSL process are fulfilled by a special kernel driver called a pico provider. I wanted to flesh out this knowledge from the perspective of a minifilter sitting on the file system stack.

For this research, I am using the following Windows 10 RS2 preview build.
Windows 10 Kernel Version 15007 MP (2 procs) Free x64
Built by: 15007.1000.amd64fre.rs_prerelease.170107-1846

Installation

Here is an article on how to get bash up and running on Windows 10 RS1 and later. One major point that I initially missed is that this is only supported on 64-bit versions of Windows.

Mostly Normal

Once everything is up and running, the bash process appears in Task Manager as normal.

bash_task_manager

First, let’s use Procmon to verify that WSL does use the normal file system stack.

$ read < "a.txt"

bash_read_procmon

That’s a good start but I had two questions after seeing the above output:

  1. Why is the process name missing?
  2. Why is there a CREATE with no permissions requested?

Pico Processes

I’m using v3.31 of Procmon and it’s not able to show the process name for a WSL process. We know that WSL processes are implemented as pico processes so this makes some sense. Even WinDBG gets confused:

0: kd> !process 0n440
Searching for Process with Cid == 1b8
PROCESS ffffa90bbef49080
 SessionId: 1 Cid: 01b8 Peb: 00000000 ParentCid: 074c
 DirBase: 79ab9000 ObjectTable: ffffc00afb7b7540 HandleCount: 0.
 Image: System Process
 VadRoot ffffa90bc07b7960 Vads 370 Clone ffffa90bc0d530f0 Private 88. Modified 924. Locked 1.
 DeviceMap 0000000000000000
 Token ffffc00afa9f0960
 ElapsedTime 00:39:46.612
 UserTime 00:00:00.000
 KernelTime 00:00:00.000
 …
 
 THREAD ffffa90bbd889080 Cid 01b8.1b24 Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable
 ffffa90bbeef81e0 NotificationEvent
 ffffa90bc0409bd0 NotificationEvent
 Not impersonating
 Owning Process ffffa90bbef49080 Image: System Process
 Attached Process N/A Image: N/A
 Wait Start TickCount 299722 Ticks: 2439 (0:00:00:38.109)
 Context Switch Count 1553 IdealProcessor: 1 
 UserTime 00:00:00.031
 KernelTime 00:00:00.906
 Stack Init ffffd400f8378c90 Current ffffd400f8378310
 Base ffffd400f8379000 Limit ffffd400f8373000 Call 0000000000000000
 Priority 8 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5
 Child-SP RetAddr Call Site
 ffffd400`f8378350 fffff801`7fa77867 nt!KiSwapContext+0x76
 ffffd400`f8378490 fffff801`7fa76f90 nt!KiSwapThread+0x477
 ffffd400`f8378540 fffff801`7fa76583 nt!KiCommitThreadWait+0x160
 ffffd400`f83785e0 fffff80b`b452e1b6 nt!KeWaitForMultipleObjects+0x203
 ffffd400`f83786c0 fffff80b`b452e379 LXCORE!LxpThreadWait+0x126
 ffffd400`f8378770 fffff80b`b44c6d38 LXCORE!LxpThreadWaitEx+0x49
 ffffd400`f83787d0 fffff80b`b45480e4 LXCORE!LxpDevTerminalRead+0x170
 ffffd400`f8378860 fffff80b`b45251e1 LXCORE!VfsFileRead+0x1b4
 ffffd400`f8378900 fffff80b`b451e22c LXCORE!LxpSyscall_READ+0x101
 ffffd400`f83789d0 fffff80b`b453be76 LXCORE!LxpSysDispatch+0x16c
 ffffd400`f8378aa0 fffff801`800ac2c0 LXCORE!PicoSystemCallDispatch+0x16
 ffffd400`f8378ad0 fffff801`7fb8a392 nt!PsPicoSystemCallDispatch+0x20
 ffffd400`f8378b00 00007f06`a24e69b0 nt!KiSystemServiceUser+0x82 (TrapFrame @ ffffd400`f8378b00)
 00007fff`fcb27498 00000000`004aee97 0x00007f06`a24e69b0
 00007fff`fcb274a0 00000000`00000000 0x4aee97

Just for confirmation, this is indeed a Pico Process and so LXCORE must be the Pico Provider.

0: kd> dt _EPROCESS ffffa90bbf26a500

nt!_EPROCESS
 …
 +0x6d4 Minimal : 0y1
 …
 +0x718 PicoContext : 0xffffc00a`fa03c000 Void
 …
 +0x818 PicoCreated : 0y1

Bash Process

There is actually a win32 bash.exe process on the system but my understanding is that it is only responsible for launching the WSL pico processes.

PROCESS ffffa90bc17b9340
 SessionId: 1 Cid: 0c4c Peb: 4ac55f3000 ParentCid: 0f80
 DirBase: 1ebf2000 ObjectTable: ffffc00afbb466c0 HandleCount: 90.
 Image: bash.exe

0: kd> dt _EPROCESS ffffa90bc17b9340
nt!_EPROCESS
 …
 +0x6d4 Minimal : 0y0

I/O Pattern

In the Procmon output, we see two CREATE operations. The first CREATE opens the file without any permissions and the second opens the same file with a number of permissions. As it turns out, the two CREATE operations are very much related and not in an obvious way.

I used WinDBG to set a conditional breakpoint in my minifilter PreCreate operation:

bp Myfilter!PreCreate ".if (ffffa90bbef49080== $proc) {} .else {gc}"

The file object in the first CREATE operation looks completely normal. This is a relative open to a directory that has already been opened (my Procmon filter excluded this original open).

1: kd> dx -r1 (*((FLTMGR!_FILE_OBJECT *)0xffffa90bc183b530))
(*((FLTMGR!_FILE_OBJECT *)0xffffa90bc183b530)) [Type: _FILE_OBJECT]
 [+0x000] Type : 5 [Type: short]
 [+0x002] Size : 216 [Type: short]
 [+0x008] DeviceObject : 0xffffa90bbe7c8b50 : Device for "\Driver\volmgr" [Type: _DEVICE_OBJECT *]
 [+0x010] Vpb : 0x0 [Type: _VPB *]
 [+0x018] FsContext : 0x0 [Type: void *]
 [+0x020] FsContext2 : 0x0 [Type: void *]
 [+0x028] SectionObjectPointer : 0x0 [Type: _SECTION_OBJECT_POINTERS *]
 [+0x030] PrivateCacheMap : 0x0 [Type: void *]
 [+0x038] FinalStatus : 0 [Type: long]
 [+0x040] RelatedFileObject : 0xffffa90bc19e3df0 [Type: _FILE_OBJECT *]
 [+0x048] LockOperation : 0x0 [Type: unsigned char]
 [+0x049] DeletePending : 0x0 [Type: unsigned char]
 [+0x04a] ReadAccess : 0x0 [Type: unsigned char]
 [+0x04b] WriteAccess : 0x0 [Type: unsigned char]
 [+0x04c] DeleteAccess : 0x0 [Type: unsigned char]
 [+0x04d] SharedRead : 0x0 [Type: unsigned char]
 [+0x04e] SharedWrite : 0x0 [Type: unsigned char]
 [+0x04f] SharedDelete : 0x0 [Type: unsigned char]
 [+0x050] Flags : 0x20002 [Type: unsigned long]
 [+0x058] FileName : "a.txt" [Type: _UNICODE_STRING]
 [+0x068] CurrentByteOffset : {0} [Type: _LARGE_INTEGER]
 [+0x070] Waiters : 0x0 [Type: unsigned long]
 [+0x074] Busy : 0x0 [Type: unsigned long]
 [+0x078] LastLock : 0x0 [Type: void *]
 [+0x080] Lock [Type: _KEVENT]
 [+0x098] Event [Type: _KEVENT]
 [+0x0b0] CompletionContext : 0x0 [Type: _IO_COMPLETION_CONTEXT *]
 [+0x0b8] IrpListLock : 0x0 [Type: unsigned __int64]
 [+0x0c0] IrpList [Type: _LIST_ENTRY]
 [+0x0d0] FileObjectExtension : 0x0 [Type: void *]

The file object in the second CREATE is also a relative open, but relative to the file itself!

0: kd> dx -r1 (*((FLTMGR!_FILE_OBJECT *)0xffffa90bbe60bef0))
(*((FLTMGR!_FILE_OBJECT *)0xffffa90bbe60bef0)) [Type: _FILE_OBJECT]
 [+0x000] Type : 5 [Type: short]
 [+0x002] Size : 216 [Type: short]
 [+0x008] DeviceObject : 0xffffa90bbe7c8b50 : Device for "\Driver\volmgr" [Type: _DEVICE_OBJECT *]
 [+0x010] Vpb : 0x0 [Type: _VPB *]
 [+0x018] FsContext : 0x0 [Type: void *]
 [+0x020] FsContext2 : 0x0 [Type: void *]
 [+0x028] SectionObjectPointer : 0x0 [Type: _SECTION_OBJECT_POINTERS *]
 [+0x030] PrivateCacheMap : 0x0 [Type: void *]
 [+0x038] FinalStatus : 0 [Type: long]
 [+0x040] RelatedFileObject : 0xffffa90bc183b530 [Type: _FILE_OBJECT *] <-- Hey look, it's the first file object!
 [+0x048] LockOperation : 0x0 [Type: unsigned char]
 [+0x049] DeletePending : 0x0 [Type: unsigned char]
 [+0x04a] ReadAccess : 0x0 [Type: unsigned char]
 [+0x04b] WriteAccess : 0x0 [Type: unsigned char]
 [+0x04c] DeleteAccess : 0x0 [Type: unsigned char]
 [+0x04d] SharedRead : 0x0 [Type: unsigned char]
 [+0x04e] SharedWrite : 0x0 [Type: unsigned char]
 [+0x04f] SharedDelete : 0x0 [Type: unsigned char]
 [+0x050] Flags : 0x20002 [Type: unsigned long]
 [+0x058] FileName : "" [Type: _UNICODE_STRING]
 [+0x068] CurrentByteOffset : {0} [Type: _LARGE_INTEGER]
 [+0x070] Waiters : 0x0 [Type: unsigned long]
 [+0x074] Busy : 0x0 [Type: unsigned long]
 [+0x078] LastLock : 0x0 [Type: void *]
 [+0x080] Lock [Type: _KEVENT]
 [+0x098] Event [Type: _KEVENT]
 [+0x0b0] CompletionContext : 0x0 [Type: _IO_COMPLETION_CONTEXT *]
 [+0x0b8] IrpListLock : 0x0 [Type: unsigned __int64]
 [+0x0c0] IrpList [Type: _LIST_ENTRY]
 [+0x0d0] FileObjectExtension : 0x0 [Type: void *]

It is perfectly legal to open a file relative to itself, but it’s also easy to mistake this pattern for a volume open. Alex Carp has a great article that helped me understand the pattern a little better. An excerpt:

reopen (FileObject->FileName is empty (Length == 0 and Buffer == NULL) but RelatedFileObject is not null). This is used when the caller wants to open a new handle to an existing FILE_OBJECT. This is not the same as opening a new handle to the existing FILE_OBJECT (duplicating the handle) because the end result of this is to open a new FILE_OBJECT for the same underlying stream, and the two FILE_OBJECTs are not linked in any other way (for example, a filter might want to open its own handle to a user file without bothering to figure out if the user has enough access (if the minifilter might want to write to the file and the original handle didn’t allow it then duplicating the handle is more complicated) or without interfering with the additional information stored in the FILE_OBJECT (like the current pointer position) and so on. According to the FASTFAT source code this should work for volume opens as well. This actually happens occasionally so filters should be prepared to deal with it.

Conclusion

I’m happy that things look fairly normal. The WSL read pattern is a little strange, and may require some tweaking in a minifilter that tries to detect volume opens.

Posted in Debugging | Leave a comment

Why does Notepad++ hang when I open it?

I am generally happy with Notepad++ but every now and then I find that it takes a long time to open. Yesterday, I started seeing a 20-second hang and after opening Notepad++ a few times, I became frustrated enough to troubleshoot the problem. The good news is that it took about 10 minutes to figure out the problem, which is a good ROI — once I open Notepad++ about 30 more times, I will have made back the time spent fixing it.

You can read the basics of using Windows Performance Recorder (WPR) and Windows Performance Analyzer (WPA) for wait analysis here: Why does Explorer hang when I right-click on a folder? If you’ve never done this before, it may take a couple of hours to get everything set up but it’s a one-time task.

Let’s pick up after I’ve already collected the trace and opened it in WPA. My first step is to look at the System Activity\UI Delays graph, which shows that Notepad++ is failing to check for Window messages for 21.9 seconds. That matches the delay I’m seeing in opening it.

notepad++_hang_ui_delay

Next, I load up the Computation\CPU Usage (Precise) graph, since it has information on every context switch.  From the UI Delays graph, I know that the hanging UI thread ID is 94092 so I set up my table as below.

notepad++_hang_cpu_precise_startIn the table, we see that the total wait time for the thread is about what we expect – 21 seconds. The longest wait is about 1 second. Next, I drill down into the thread, each time following the stack with the longest total wait time, eventually stopping when I find something “interesting”:

notepad++_hang_cpu_precise_borneo3

Above, we see XMLTools-borneo3.dll is calling the WinSock select function. MSDN says that “The select function determines the status of one or more sockets, waiting if necessary, to perform synchronous I/O.” It does appear that the function is waiting, since we see the call to WaitForSingleObject.

The other strange thing is that borneo3 is the name of my home machine. A quick look at the System Activity\Images table shows me that XMLTools-borneo3.dll is being loaded from the Notepad++ plugins directory.

At this point, I connect the dots. I use OneDrive to sync my tools across different machines. Evidently, it had a conflict when syncing and backed up the XMLTools.dll to a new name.

Once I deleted the XMLTools-borneo3.dll, the hang during open went away. Of course, it’s not clear exactly what the problem is or why it doesn’t happen with the original XMLTools.dll but as the gambler says, you gotta know when to fold them. Back to work!

Posted in Hang, Performance, Troubleshooting | Leave a comment

Symbols Do Not Load in Windows Performance Analyzer (WPA)

The Windows Performance Toolkit (WPT) has been updated with the release of Windows 10 and I’ve just upgraded. There seem to be some minor improvements, but I noticed pretty quickly that I was unable to load symbols in Windows Performance Analyzer (WPA). This is not a new problem, but I had forgotten how to fix it. Here’s an article for future me and anyone else having this problem.

BTW, Bruce Dawson is the internet’s performance guru and has written a similar article that you may want to skip to. In this article, we conclude where his article begins.

How are Symbols Loaded?

The first symptom of this problem is that when I load symbols, WPA very quickly goes through all binaries and claims that symbol load is complete. I already have some cached symbols so I see symbols for some binaries, and ? marks for others.

A quick look at the WPA Diagnostic Console shows us the crux of the problem:

DBGHELP: SymSrv load failure: symsrv.dll

We know that applications access symbol functionality via the DbgHelp API and WPA is no exception. If we take a look using Process Monitor, we see that WPA is loading dbghelp.dll from the system directory. In turn, dbghelp.dll appears to load symsrv.dll from the same directory, which fails.

wpa_symbol_load_procmon

Fix it!

At this point, we have enough information to solve the problem. You should already have Debugging Tools for Windows installed. If not, the aforementioned article by Bruce Dawson might be a quicker fix.

The solution is simple:

  1. Figure out if you are using the x86 or x64 version of the Windows Performance Toolkit.

    This is easy on x86 builds of Windows. On x64 builds, you can check the Task Manager for the *32 tag. If it’s not there, then you’re running the x64 version.

    wpa_symbol_load_task_manager

    Note that WPT always installs to Program Files (x86) regardless of architecture.

  2. Copy the dbghelp.dll and symsrv.dll files from the correct debugger directory to the Windows Performance Toolkit directory.  On my system, the relevant directories are:
    C:\Program Files (x86)\Windows Kits\10\Debuggers\x64 and 
    C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit
  3. Restart Windows Performance Analyzer so that the correct version of dbghelp.dll is picked up.
Posted in Performance | Tagged , , , | Leave a comment

Recovering a Deleted Draft in Gmail

So, here’s a funny thing about Gmail. If you delete a regular email, it gets moved to the Trash folder. However, if you discard a draft, it just vanishes in a puff of smoke. My wife discovered this frustrating behaviour tonight after spending an hour writing a message to her sister via gmail.com. Something went wrong when she tried to insert a picture and she hit the Discard Draft button in haste.

Given that the draft is not moved into the Trash folder when this happens, there are only a few options for recovery:

1. The “Undo” text appears on-screen, but only until you click on another folder / message. It’s easy to panic and click on something else once you realize that you’ve just discarded your draft.

2. If you have another device hooked up to the same account, you can quickly put it into airplane mode and perhaps recover the draft.

3. In some cases, the back button *may* work.

None of these worked in our case and so I thought of a hail mary option:

4. Scan the browser process memory for the discarded draft. This can only work if you have not yet closed the Gmail browser tab, and if the process memory has not yet been overwritten or reclaimed. Also, there need to be some reasonably obscure words in the text you are looking for (‘bazaar’ was the winner in my wife’s draft).

There are lots of tools to read process memory; I used HxD to search the process memory for keywords from the draft.

hxd0

Actually, HxD is a little clunky since it doesn’t show you the process ID of the process you are viewing. Incidentally, you can get the process ID for a Chrome tab via the Chrome Task manager (right-click on the tab bar).

hxd1

After about 9 attempts, I found the correct chrome.exe process and found the text of the discarded draft. Of course, by this time, my wife had already introduced another mitigation:

5. Rewrite the email.

hxd2

Posted in Data Recovery | 104 Comments

Vmware-hostd Listening on HTTPS Port 443

Recently, I needed to stand up a web server on my development machine to do some testing. Unfortunately, when I tried to bind to the default HTTPS port (443), I found out that some other process on my machine was already using it.

I used SysInternals TCPView tool to figure out who was using the port:

vmware-hostd1

Okay, now what? After a little googling, it turns out that VMWare Workstation has a feature called “Shared VMS” that I have no need for.

I disabled this feature under VMWare Workstation -> Edit -> Preferences and vmware-hostd.exe immediately stopped listening on port 443. Onwards with my testing!

vmware-hostd2

Posted in Uncategorized | Leave a comment

Debugging Windows Service Startup Using Procdump

tl;dr: Sleeping and attaching a debugger? Meh. Writing copious log files? Meh. In the case of a crashing service, it’s much easier to collect the crashdump and analyze.

If you’ve spent much time developing Windows Service, you’ve probably run into the case where your service mysteriously crashes while it is starting. In cases like these, Windows isn’t always particularly helpful. For example, here’s what Windows has to say about my Crashy Service:

service_crash_eventlog

“The Crashy Service service terminated unexpectedly.” Thanks Windows!

Since the crash is happening during service startup, it’s a little tricky to attach a debugger (or downright impossible in customer environments). Instead, as an exercise, let’s use the Sysinternals procdump tool to collect the crash dump and Visual Studio 2013 to figure out what went wrong.

Here are the steps:

1. We set procdump as the system postmortem debugger using the -i flag. This allows procdump to write out the state of the process before it crashes.

service_crash_procdump

2. Now we reproduce the service startup crash. Before the service crashes, procdump will write the crashdump to the directory where we ran the -i command in the previous step.

3. (Optional) Revert the system postmortem debugger to the previous value using the procdump -u command.

service_crash_procdump_uninstall

4. Open the .dmp crashdump file in Visual Studio 2013. When I do this, I see an initial screen that tells me a bit more about the exception that led to the crash. I then have the option of digging into the crashdump via the “Debug with Native Only” action.

service_crash_vs00

service_crash_vs0

5. When I click on Debug with Native Only, I get access to the call stack as well as the exact line of code that caused the crash (give or take — nobody’s perfect!) In this case it appears that I’m trying to lower case a null string, which throws an access denied exception.

service_crash_vs1

service_crash_vs2

More info: I’m analyzing the binary on the same machine where I compiled it. In other situations, you will have to tell VS where your symbols are. Also, in general I would probably use WinDbg to analyze the crashdump as I have way more experience with it. !analyze -v is your friend.

Posted in Uncategorized | Leave a comment

Performance Analysis Toolbox: Timeouts

Music is the space between the notes
– Claude Debussy

I recently spent some time looking at a slow shutdown issue. When product X was installed, the system was taking 34 seconds to shut down, vs 22 seconds when the product was not installed. Our automation team was able to reproduce this issue using the Full Boot Assessment in the Windows ADK. This gave us a very good place to start. In addition to a regular .etl trace file, the ADK also provides some extra information to help organize the millions of system events into a coherent narrative.

Image

After comparing this trace to a normal shutdown, it was clear the the IO Shutdown System section was much longer than normal — 10 seconds vs. close to 0 seconds in the normal case.

So, I got out my performance analysis toolbox and looked at:

– CPU usage
– Wait chains
– Disk usage

None of these approaches proved fruitful. There was very little CPU activity, nobody seemed to be waiting on anyone else, and disk usage was minimal. How could performance be impacted when nobody was doing anything? I reached out to a colleague and mentioned the major facts: slow shutdown, 10 second delay during the IO Shutdown phase, not much activity. His psychic response? “10 seconds? If it is exactly 10 seconds it sounds like a timeout”. I looked at a couple more traces and sure enough, the delay was always 10 seconds plus a tenth of a second or so.

We figured out the problem in the end but the big takeaway for me is a new tool for my performance analysis toolbox — timeout analysis, where the space between the notes is more important than the notes themselves.

Posted in Performance, Uncategorized | Tagged , , | Leave a comment