Character Encoding in Agents
Well, I escaped Disneyland with my sanity intact. I spent this weekend aimlessly wandering Paris. What an amazing city. I can't believe I've left it this long to visit. I plan on returning with Karen soon and more of a plan on how to spend my time here.
Today I am in La Défense, which seems a bit like London's Canary Wharf. I'm spending a couple of days working on site for Saint-Gobain, a long-time client of mine. It seemed rude to come to Paris and not call in.
First task today was to fix the RSS feed I added to a database of theirs. This really had me stumped for a while. I'll share with you in case it helps you out one day.
The feed is an Agent that churns out XML for documents in the RSS format. This worked fine until special characters were included. Me being English, I foolishly neglected to test it with things like accented characters. Needless to say, them being French, they found the bug straight away. The bug was that IE would not render the XML if special characters were used. It said "An invalid character was found in text content.".
The confusing part of it was that the XML for a view containing the same information as the feed would display fine when opened with a ?ReadViewEntries URL command. My task was to find out the difference and, hence, the reason why.
First thing I did was save the source to both pages - the one that worked and the one that didn't. I then used TextPad to find the difference. After messing about for ages I realised there was no difference in the text and so took a look at the document properties (Alt + Enter in Textpad). Herein lied the problem. The file saved from my Agent output had a codeset of ANSI and a UNIX file-type. The file saved from Domino's XML had UTF-8 and PC respectively. My code was using the wrong character encoding!
Here's the first two lines of code from the Agent:
Print {Content-Type:text/xml} Print {<?xml version="1.0" encoding="UTF-8" ?>}
The first line tells the browser to expect XML, but not what character set to expect. This is why it complains about anything other than standard text output. To fix the problem I made sure the agent told the browser to expect the same charset as the XML used by changing the first line, like so:
Print {Content-Type:text/xml;charset=utf-8} Print {<?xml version="1.0" encoding="UTF-8" ?>}
Moral of the story: If you're going to output XML from an Agent, make sure it outputs the text in the same encoding as specified in the XML tag. And, of course, if you're client is Français, soyez sûr d'examiner.
Yes, these bugs tend to be pretty nasty.
I had some really bad issues with Notes Richtext fields. NotesRichtextRange.FindAndReplace just kept adding some garbage at paragraph ends.
It turned out to be a bug which appears only when you are running on a multi-byte language system. Which happens to be my case. Apparently nobody at IBM noticed for some time since they mostly run just English.
So it is not only you who needs more testing in multinational environments...
Just great. Saint-Gobain is a customer of mine as well, but where is my contact located. Indiana. Sheesh, I want the Paris branch.
On the technical issue, it is very easy to get caught with character encoding issues, especially since in some cases it isn't an explicit character set, but an implicit one. Ugh!
Josef - If that is still a bug, our Midas Rich Text LSX will do search and replace with different character sets. Just an FYI.
Thanks for blogging this! The Dutch language uses accented characters as well and I ran into the same problem. Saves me a lot of time, better spend it on learning French! :)
Thanks for the tip, also very helpful when using ViewTemplates for an XML view!