Wednesday, October 22, 2008

Encodings Can Byte You

It is 2008, right? I mean, in this day and age, it's not totally unreasonable to expect that NSString to default to using an NSUTF8StringEncoding as the default encoding when loading a file if it can't determine the encoding, right? I mean, since an ASCII file will load correctly using UTF-8, but not vice versa, UTF-8 seems like the best assumption or, at least, a better assumption than ASCII.

Well, it might be a "reasonable" assumption, but it's an incorrect one. NSString's initWithContentsOfFile: apparently assumes an encoding of NSASCIIStringEncoding when it can't determine the real encoding.

This, I'm embarrassed to say, bit me today. I had the following code that I was using to load a UTF-8 file with non-ASCII characters:

NSString *sourcePath = [[NSBundle mainBundle] pathForResource:@"act texts" ofType:@"txt"];
NSString *sourceData = [[NSString alloc] initWithContentsOfFile:sourcePath];

But the resulting strings didn't contain the correct non-ASCII characters that were in the file; every time there was supposed to be a diacritical or other non-ASCII character, the string contained two high-order (>128) ASCII characters. That's an indication that you've got UTF-8 data being loaded as ASCII (UTF-16 looks even weirder when it misses the encoding, making it easier to catch).

The solution is simple enough. Just explicitly tell it the encoding to use by calling initWithContentsOfFile:encoding:error: instead :

NSString *sourcePath = [[NSBundle mainBundle] pathForResource:@"act texts" ofType:@"txt"];
NSString *sourceData = [[NSString alloc] initWithContentsOfFile:sourcePath encoding:NSUTF8StringEncoding error:nil];