
Introduction
In some cases you need to know what is the best codepage (encoding) to either transfer
a text over the internet or store it in a text file. One could argue that Unicode is always does the trick but I needed the most efficient (byte saving) way
to transfer data.
Detecting the code page from a given text is a very tricky task. But luckily Microsoft
provides the MLang API. In special the IMultiLang3 interface which is for outbound
encoding detection.
Analogous the IMultiLang2 interface has a function to detect the encoding of an incoming byte array.
This is very handy for codepage detetion of text stored in files or send over the internet.
The EncodingTools class offers some easy to use functions to determine
the best encoding for different scenarios.
Background
The problem
I started this along with another component that constructs MIME conformant emails.
The body of the email is passed as String. The user had to provide the charset to use for the Transfer-Encoding by hand. This is fine as long
as you know the target character set or assume always Unicode. But it definetly
is not a good solution if you have an end-user GUI application (most users do not
even know what an "encoding" is).
I wondered if it is possible to detect the best encoding from the given text....
The dirty hack attempt
My
first attempt was a simple brute-force attack:
- built a list of suitable encodings (only iso-codepages and unicode)
- Iterate over all considered encodings
- encode the text using this encoding
- encode it back to unicode
- compare the results for errors
- if no errors remember the encoding that produced the fewest bytes
This nis not only ugly, it does not even work properly. All single byte encodings
are binary equal in their encoding result.
The codepage is only used to map the single bytes to the correct character for display.
So this method can only distinguish between ASCII (7bit), single byte (8bit) and the different Unicode flavors (UTF-7, UTF8, Unicode etc.).
Finding somthing better
Then I remembered the IMultiLang2.DetectInputCodepage method that was introduces along with the Internet Explorer 5.0.
This method detects the encoding used in a text (used by the Internet Explorer to do automatic codepage detection if the header is missing from a page).
Even this was not suitable for my problem I wondered if there might have been some development since version 5.0. A wrapper
function to the DetectInputCodepage is provided in the EncodingTools class.
Since Internet Explorer 5.5 there is a new interface exported from the MLang dll: IMultiLang3.
This is what MSDN says about this interface:
This interface extends IMultiLanguage2
by adding outbound text detection functionality to it.
Wow! This sounded more than promising! The interface has only two methods:
- DetectOutboundCodePage (for strings)
- DetectOutboundCodePageInIStream (for streams)
I chose to use the first one.
Using MLang
The MLang.dll is in the Windows\system32 directory. Along some exported functions
it provides some COM classes but does not contain a typelibrary. So the easy way
(Add Reference... in Visual Studio) did not work.
The MLang.idl is part of the Platform SDK and can be found in the include directory.
To create an assembly from the idl file use the following commands from the Visual Studio Command Prompt:
c:\temp\>midl MLang.idl
C:\temp>midl MLang.idl > null
Microsoft (R) 32b/64b MIDL Compiler Version 6.00.0366
Copyright (c) Microsoft Corporation 1991-2002. All rights reserved.
MLang.idl
unknwn.idl
wtypes.idl
basetsd.h
guiddef.h
oaidl.idl
objidl.idl
oaidl.acf
C:\temp>tlbimp mlang.tlb /silent
The result of those two commands is a brand new Assembly named MultiLanguage.dll.
Using Lutz Roeder's and Reflector I had a look at the signature:
MethodImpl(MethodImplOptions.InternalCall, MethodCodeType=MethodCodeType.Runtime)]
void DetectOutboundCodePage([In] uint dwFlags,
[In, MarshalAs(UnmanagedType.LPWStr)] string lpWideCharStr,
[In] uint cchWideChar,
[In] ref uint puiPreferredCodePages,
[In] uint nPreferredCodePages,
[In] ref uint puiDetectedCodePages,
[In, Out] ref uint pnDetectedCodePages,
[In] ref ushort lpSpecialChar);
I was not so happy with the ref uint for the puiPreferredCodePages and puiDetectedCodePages parameters. Also a typed enum for the dwFlags was missing.
So I first exported the generated assembly to c# source code and then changed it a litle:
[Flags]
public enum MLCPF
{
// Not currently supported.
MLDETECTF_MAILNEWS = 0x0001,
// Not currently supported.
MLDETECTF_BROWSER = 0x0002,
// Detection result must be valid for conversion and text rendering.
MLDETECTF_VALID = 0x0004,
// Detection result must be valid for conversion.
MLDETECTF_VALID_NLS = 0x0008,
// Preserve preferred code page order.
// This is meaningful only if you have set the puiPreferredCodePages parameter
// in IMultiLanguage3::DetectOutboundCodePage
// or IMultiLanguage3::DetectOutboundCodePageInIStream.
MLDETECTF_PRESERVE_ORDER = 0x0010,
// Only return one of the preferred code pages as the detection result.
// This is meaningful only if you have set the puiPreferredCodePages parameter
// in IMultiLanguage3::DetectOutboundCodePage
// or IMultiLanguage3::DetectOutboundCodePageInIStream.
MLDETECTF_PREFERRED_ONLY = 0x0020,
// Filter out graphical symbols and punctuation.
MLDETECTF_FILTER_SPECIALCHAR = 0x0040,
// Return only Unicode codepages if the euro character is detected.
MLDETECTF_EURO_UTF8 = 0x0080
}
[MethodImpl(MethodImplOptions.InternalCall, MethodCodeType=MethodCodeType.Runtime)]
void DetectOutboundCodePage([In] MLCPF dwFlags,
[In, MarshalAs(UnmanagedType.LPWStr)] string lpWideCharStr,
[In] uint cchWideChar,
[In] IntPtr puiPreferredCodePages,
[In] uint nPreferredCodePages,
[In] IntPtr puiDetectedCodePages,
[In, Out] ref uint pnDetectedCodePages,
[In] ref ushort lpSpecialChar);
Then I added the source files to my project (no more MultiLanguage.dll assembly
needed).
Using IMultiLanguage3::DetectOutboundCodePage
Geting an instance of COM class implementing the IMultiLanguage3 is straight forward:
// get the IMultiLanguage3 interface
MultiLanguage.IMultiLanguage3 multilang3 =
new MultiLanguage.CMultiLanguageClass();
if (multilang3 == null)
throw new System.Runtime.InteropServices.COMException("Failed to get IMultilang3");
The next thing is to fill the parameters.
The first parameter dwFlags is a combination of the tagMLCPF flags.
I chose always to set the MLDETECTF_VALID_NLS because the result will be used for conversion.
The MLDETECTF_PRESERVE_ORDER and MLDETECTF_PREFERRED_ONLY are used depending on the parameters passed to my detection method.
The next to parameters (lpWideCharStr and cchWideChar) are simply the sting passed for detection and its length
With the next two parameters (puiPreferredCodePages and nPreferredCodePages) the detection can be limited to a subset of all codepages.
This is very usefully
if you only want to return a certain subset of codepages.
The last three parameters contain the result of the detection after the method has completed successfully.
So the actual call looks like this:
uint[] preferedEncodings; // array of uint passed as parameter to the function
int[] resultCodePages = new int[preferedEncodings.Length]; // result array
// ... call the function
multilang2.DetectInputCodepage(options,0, ref input[0], ref srcLen,
ref detectedEncdings[0], ref scores);
// evaluate the result
if (scores > 0)
{
for (int i = 0; i < scores; i++)
{
// add the result
result.Add(Encoding.GetEncoding((int)detectedEncdings[i].nCodePage));
}
}
Finally the COM object should be freed.
Marshal.FinalReleaseComObject(multilang3);
Using IMultiLanguage2::DetectInputCodepage
After being able to choose the best encoding to save send a text over the internet
or save it to a stream the next task was to detect the best encoding for incoming
text if the sender (or storer) did not chose the best encoding.
The DetectInputCodepage has (at least) two practical uses. By default Windows stores text files in the current default (UI) Encoding.
For example on my system this is "Windows-1252". A user from russia will write its text using "Windows-1251".
Both codepages are singlebyte and do not have any preamble. So a text file will not
contain any information about the used codepage.
So if you open a text file containing text created with codepage diffrent to the current UI code page a StreamReader will read the text as if it was stored in the UI's current codepage.
(The encoding detection of the StreamReader is mostly a preamble check. So it wil fail for almost any non Unicode files (or those Unicode files without BOM.)
Most characters outside of the common ASCII charset will be displayed incorrectly.
This is where the DetectInputCodepage comes in handy. Its accurance is not 100% but it is definetively better than the on from the StreamReader.
In the demo application you can double click on a encoding to test which method
has the better result (see "Testing the DetectInputCodepage perfomance" below).
The other practical use is, to detect the encoding of emails from badly implemented
mime mailers. Some wired mailers send emails in 8bit encoding without specifying
any characterset in the header. In this case DetectInputCodepage can
help a lot.
As for the DetectOutboundCodePage method I change the method signature a little and added the MLDETECTCP enumeration. The resulting code looks like this:
public enum MLDETECTCP {
// Default setting will be used.
MLDETECTCP_NONE = 0,
// Input stream consists of 7-bit data.
MLDETECTCP_7BIT = 1,
// Input stream consists of 8-bit data.
MLDETECTCP_8BIT = 2,
// Input stream consists of double-byte data.
MLDETECTCP_DBCS = 4,
// Input stream is an HTML page.
MLDETECTCP_HTML = 8,
//Not currently supported.
MLDETECTCP_NUMBER = 16
}
[MethodImpl(MethodImplOptions.InternalCall, MethodCodeType=MethodCodeType.Runtime)]
void DetectInputCodepage([In] MLDETECTCP flags, [In] uint dwPrefWinCodePage,
[In] ref byte pSrcStr, [In, Out] ref int pcSrcSize,
[In, Out] ref DetectEncodingInfo lpEncoding,
[In, Out] ref int pnScores);
The usage of the function is almost identical to the DetectOutboundCodePage described earlier.
int maxEncodings; // parameter specifying how many encodings to return
int srcLen = input.Length; // lengt of the input
int scores = detectedEncdings.Length; // the number of detected scores
// setup options (none)
MultiLanguage.MLDETECTCP options = MultiLanguage.MLDETECTCP.MLDETECTCP_NONE;
// finally... call to DetectInputCodepage
multilang2.DetectInputCodepage(options,0, ref input[0], ref srcLen,
ref detectedEncdings[0], ref scores);
// get result
if (scores > 0)
{
for (int i = 0; i < scores; i++)
{
// add the result
result.Add(Encoding.GetEncoding((int)detectedEncdings[i].nCodePage));
}
}
My first tests were not that promissing. I always had a COMExcpetion with E_FAIL thrown when i tried to detect a codepage.
The DetectInputCodepage will fail on too short texts with not BOM (Byte
Order Mask / Encoding Preamble) prefixed data. There are two kinds of faliure. If the input data
is very short (less than 60 bytes) there is a good chance that the wrong codepage
will be detected. Below 200 bytes there is a good chance that DetectInputCodepage will return E_FAIL, because
it could not finally decide which codepage to use. For the later problem I implemented
a nasty workaround. I simply multiplied the input data up to 256 bytes. This seems
to return reasonable results even for short strings.
// expand the string to be at least 256 bytes
if (input.Length < 256)
{
byte[] newInput = new byte[256];
int steps = 256 / input.Length;
for (int i = 0; i < steps; i++)
Array.Copy(input, 0, newInput, input.Length * i, input.Length);
int rest = 256 % input.Length;
if (rest > 0)
Array.Copy(input, 0, newInput, steps * input.Length, rest);
input = newInput;
}
Wrapping it all up
I decided to create a static class to provide access to the DetectOutboundCodePage and DetectInputCodepage methods.
It has some public methods that offer different levels of abstraction.
Those are the six high level Methods that are should cover most of the usage scenarios:
- GetMostEfficientEncoding
- GetMostEfficientEncodingForStream
- DetectInputCodepage
- ReadTextFile
- OpenTextFile
- OpenTextStrem
It also has three public static arrays of predefined codpage sets:
- PreferedEncodings
- PreferedEncodingsForStream
- AllEncodings
Those arrays contain the codepages not in the natural sort order, but in the order
that return the best result.
Testing the DetectInputCodepage performance
The the screenshot below shows a comparison of the StreamReader encoding detection and the EncodingTools detection. The sample texts come from Unciode.org.
Actually all the samples were detected correctly.
Using the EncodingTools class
The folowing code snippes show how to use the
EncodingTools class.
Outgoing Encoding
Detect best encoding for a Stream
// save the given text using the optimal encoding
private void SaveToStream(string text, string path)
{
// this is all... detect the encoding
Encoding enc = EncodingTools.GetMostEfficientEncodingForStream(text);
// then safe
using (StreamWriter sw = new StreamWriter(path, false, enc))
sw.Write(text);
}
Detect best encoding for an email body
// save the given text using the optimal encoding
private void SaveToAsEmail(string text, string path)
{
// this is all... detect the encoding
Encoding enc = EncodingTools.GetMostEfficientEncoding(text);
// then safe
using (StreamWriter sw = new StreamWriter(path, false, Encoding.ASCII))
{
sw.WriteLine("Subject: test");
sw.WriteLine("Transfer-Encoding: 7bit");
sw.WriteLine("Content-Type: text/plain;\r\n\tcharset=\"{0}\"", enc.BodyName);
sw.WriteLine("Content-Transfer-Encoding: base64"); // should be QP
sw.WriteLine();
sw.Write(Convert.ToBase64String(enc.GetBytes(text),Base64FormattingOptions.InsertLineBreaks));
}
}
Incoming Encoding
Open a Text File
private void OpenTextFileTest()
{
// read the complete file into a string
string content = EncodingTools.ReadTextFile(@"C:\test\txt");
// create a StreamReader with the guessed best encoding
using (StreamReader sr = EncodingTools.OpenTextFile(@"C:\test\txt"))
{
string fileContent = sr.ReadToEnd();
}
}
Reading from a Stream
private void ReadStreamTest()
{
// create a streamReader from a stream
using (MemoryStream ms = new MemoryStream(
Encoding.GetEncoding("windows-1252").GetBytes("Some umlauts: öäüß")))
{
using (StreamReader sr = EncodingTools.OpenTextStream(ms))
{
string fileContent = sr.ReadToEnd();
}
}
}
References
- MLang documentation
on MSDN
History
-
17/01/2007: initial release