Knowledge base survey

To provide a basis for designing the description language, the Microsoft Developers Network (MSDN) knowledge base was used. The knowledge base section is available on line or it is released in part on the MSDN CD's to subscribers.

A single MSDN release was taken (January 1999) and the knowledge base articles classified. This proved to be more than a non-trivial task as there is no external classification of articles so the only thing to rely on was the text inside the article. The text used to classify articles is shown in the following image.


The selection of Win32/MFC articles was by using the "applies to" section and the type of article by using the first characters up to the colon (:) in the article title. This left us with 935 articles within the Win32 collection and 1045 articles within the MFC collection.

There are six categories within MSDN that articles can be categorized under: BUG, FIX, PRB, INFO, HOW-TO and DOC. The first category, BUG, is exactly as expected – accidental defects that Microsoft knows about and are investigating for a possible future fix. When fixed, the article will metamorphose into a FIX by simply changing its title slightly (the ID number stays the same). FIX articles generally hang around in the knowledge base as reminders that a BUG has been fixed in a later release, and that the user should upgrade. In a slightly different vein, problem (PRB) articles are typical programmer coding errors, or product-design issues that appear to be bugs but are in-fact “by design”. HOW-TO articles demonstrate a programming principal or technique using sample code and the final two categories, information (INFO) and documentation (DOC), contain descriptions, information and hints. Whilst not unlike HOW-TOs, INFO and DOC articles rarely contain source code because the areas they address are either very specialized (e.g. the meaning of a registry key) or concern the development environment itself (e.g. changing the text size of tab labels). If a distinction can be made between INFO and DOC it is that DOC articles generally refer to printed or online documentation whereas INFO is implied knowledge. In practice however these categories are used interchangeably.

For each collection, we attempted to categorize each article using the initial characters in the title string (see above). Although this has produced reasonable results, it is by no means a perfect categorization– some of the articles simply don’t have the type prefix in the title (which we categorize as OTHER) and some are simply misfiled. In addition, there is little consistent information to distinguish runtime defects from non-runtime defects (Some articles use the “BUG: Cxxxx” or “BUG: LNKxxxx” syntax in their titles to signify compiler or linker defects but it is not consistently used and does not apply for other non-runtime defects (e.g. IDE, resources).

Between the two collections, most categories are comparable in terms of relative percentages of articles apart from “Fixes” and “Information” which are significantly different. These differences could be attributed to the maturity of the technologies where over time they are scrutinized more (Win32 is a much older technology). Therefore, when looking at the number of new fixes in the Win32 collection, we would expect to see fewer than in a newer technology (2% compared with 22% in MFC). Conversely, as programmers learn how to use a technology, exchange of information (how to perform certain tasks, potential pitfalls, etc) should increase (29% compared to 7% in MFC).

In order to explain the differences, below is a classic graph used to show that over time, the number of defects in a given technology software levels out. Whereas the defect rate in hardware can increase as it gets older (due to wear out), the defect rate of software should be continually reducing as defects are found and fixed. Barring change episodes, for example maintenance or new features, the defect rate should continue until obsolescence

In order to seek evidence for this hypothesis, we analyzed the available back catalogue of MSDN libraries – 8 releases over 2 years

One of the most surprising results of the survey was that the size of the knowledge base was being actively managed. Rather than articles simply being added cumulatively, articles were frequently being re-classified or dropped altogether. MFC articles remained reasonably static at 1100 articles and Win32 at around 900 articles

 

From this data we can see three underlying trends. The first, and perhaps most obvious, is that the number of unclassified articles within both MFC and Win32 has been falling. This is even more apparent when looking at the stacked bar graphs below where unclassified articles are frequently being removed and new, better categorized, articles are being added. The second trend, the number of bugs or problems being recorded, ties in with the general observation that the number of defects in software levels out over time as the number of defects within Win32 remains reasonably constant – a factor common with mature technologies – with virtually no fixes being added. On the other hand, the defect rate within MFC is still rising (albeit at quite a slow rate) signifying that it is still an immature technology. Associated fixes are being carried out on MFC but the frequency is starting to diminish. Finally, corresponding with our observation on the flow of information for older technologies, it is apparent that articles containing information for Win32 have been consistently rising (the only category within Win32 that has changed significantly). A similar trend can be seen to be starting within MFC, but it would appear that information is being passed on by using practical examples (HOWTO articles) rather than as information or documentation

It is apparent that this could turn out to be interesting further work but the available issues of the MSDN library was not enough to come up with significant conclusions - there are hints, which I've tried to draw attention to above, but nothing conclusive. This, combined with the problems of parsing out the correct article types, left me feeling that due to the small number of releases I had available a good study was just out of reach.

I'd really like to continue with this part of the work so if any readers have old MSDN library (or TechNet) CD's and don't mind letting me borrow them (if you want them back I'd gladly return them) I might be able to gather a better sample population.

 

 

[Back to top]