arsd.characterencodings

This is meant to help get data from the wild into utf8 strings so you can work with them easily inside D.

The main function is convertToUtf8(), which takes a byte array of your raw data (a byte array because it isn't really a D string yet until it is utf8), and a runtime string telling it's current encoding.

The current encoding argument is meant to come from the data's metadata, and is flexible on exact format - it is case insensitive and takes several variations on the names.

This way, you should be able to send it the encoding string directly from an XML document, a HTTP header, or whatever you have, and it ought to just work.

Members

Functions

convertToUtf8
string convertToUtf8(immutable(ubyte)[] data, string dataCharacterEncoding)

Takes data from a given character encoding and returns it as UTF-8

convertToUtf8Lossy
string convertToUtf8Lossy(immutable(ubyte)[] data, string dataCharacterEncoding)

Like convertToUtf8, but if the encoding is unknown, it just strips all chars > 127 and calls it done instead of throwing

decodeImpl
string decodeImpl(ubyte[] data, dchar[] chars160to255, dchar[] chars128to159, dchar[] chars0to127)
Undocumented in source.
tryToDetermineEncoding
string tryToDetermineEncoding(ubyte[] rawdata)

Tries to determine the current encoding based on the content. Only really helps with the UTF variants. Returns null if it can't be reasonably sure.

Variables

ISO_8859_1
dchar[] ISO_8859_1;
Undocumented in source.
ISO_8859_10
dchar[] ISO_8859_10;
Undocumented in source.
ISO_8859_11
dchar[] ISO_8859_11;
Undocumented in source.
ISO_8859_13
dchar[] ISO_8859_13;
Undocumented in source.
ISO_8859_14
dchar[] ISO_8859_14;
Undocumented in source.
ISO_8859_15
dchar[] ISO_8859_15;
Undocumented in source.
ISO_8859_16
dchar[] ISO_8859_16;
Undocumented in source.
ISO_8859_2
dchar[] ISO_8859_2;
Undocumented in source.
ISO_8859_3
dchar[] ISO_8859_3;
Undocumented in source.
ISO_8859_4
dchar[] ISO_8859_4;
Undocumented in source.
ISO_8859_5
dchar[] ISO_8859_5;
Undocumented in source.
ISO_8859_6
dchar[] ISO_8859_6;
Undocumented in source.
ISO_8859_7
dchar[] ISO_8859_7;
Undocumented in source.
ISO_8859_8
dchar[] ISO_8859_8;
Undocumented in source.
ISO_8859_9
dchar[] ISO_8859_9;
Undocumented in source.
KOI8_R
dchar[] KOI8_R;
Undocumented in source.
KOI8_R_Lower
dchar[] KOI8_R_Lower;
Undocumented in source.
Windows_1251
dchar[] Windows_1251;
Undocumented in source.
Windows_1251_Lower
dchar[] Windows_1251_Lower;
Undocumented in source.
Windows_1252
dchar[] Windows_1252;
Undocumented in source.

Examples

auto data = cast(immutable(ubyte)[]) std.file.read("my-windows-file.txt"); string utf8String = convertToUtf8(data, "windows-1252"); // utf8String can now be used

The encodings currently implemented for decoding are: UTF-8 (a no-op; it simply casts the array to string) UTF-16, UTF-32, Windows-1252, ISO 8859 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, and 16.

It treats ISO 8859-1, Latin-1, and Windows-1252 the same way, since those labels are pretty much de-facto the same thing in wild documents.

This module currently makes no attempt to look at control characters.

Meta