Archive for May, 2008
ColdFusion Including Sub-Applications
Posted by Eric in ColdFusion, Programming on May 30th, 2008
Ben Nadel has an interesting question on his blog about including sub-applications from within an existing CF application, and having the relevant sub-level Application.cfc fire off.
This is doable in a fairly simple manner but which relies on a barely-documented feature of ColdFusion, and the fact that the sub-level Application.cfc fires is completely undocumented, and may even be unintentional!
Here’s how you can do it, but because we’re wandering into a pretty hazy gray area here, I wouldn’t go using this unless you don’t have much other choice.
Application.cfc:
Then you can do the below and if it has an Application.cfc, that application.cfc will be invoked.
The caveat though is that CGI scope will still contain the variables from the source page – for example, CGI.SCRIPT_NAME will still be the script name in the URL. As a result, context-sensitive functions like ExpandPath() will operate relative to the root file being called – meaning you might not get the results you’re expecting.
Also the code above will only work 1 sub-application level deep; you’d have to tweak it if you wanted a sub-application within a sub-application, but by that point, zounds, what are you doing man?!?
The good news is that even if the sub-application executes a <cfabort>, execution will return to the calling page, so that sub-app can’t abort your own page.
Established Programmers Getting Started in Web Development
Posted by Eric in Programming on May 21st, 2008
An acquaintance of mine has recently graduated from college with a degree in computer science. Where he went, apparently they stressed theory over practice. He’s got some C++, Lisp, Java, and MySQL experience, but has never touched web development (neither HTML, nor even viewed source on a web page).
He’s hoping to get a job with a company that has him supporting Java applications with web front ends, so to do that he’ll need to get at least tolerable with HTML. He’s willing to go to training if there’s a course that’ll help, but the obvious concern is that he goes to a 5 day training and spends a significant amount of time covering “Web pages are viewed in web browsers” type material – that is to say, he’s concerned that intro to web development courses are going to be tailored to people looking to set up a family-newsletter or small-business-created-by-the-owner type site.
I’m recommending he start with the W3C tutorials: http://www.w3.org/2002/03/tutorials , which should get him a functional early level of knowledge. I suggested he could start out trying to make a personal site in HTML w/o leaning on a HTML editor, hopefully this gets him the rudimentary skills he needs. Other suggestions are welcome!
WebScarab-NG
Posted by Eric in Debugging, Programming on May 9th, 2008
WebScarab-NG is a really amazing tool that Brian introduced me to a few months back. It’s essentially a local proxy which you can use to capture the full details of HTTP requests traveling through it. It listens by default on port 8008 on your local address, and you can configure any software to use that port as a proxy.
If you choose, you can even configure it to intercept requests and responses, and it allow you to modify the data on the fly – really useful when you want to test fault circumstances.
How UTF-8 Encoding works
Posted by Eric in Programming on May 3rd, 2008
I spoke yesterday about Unicode, and the difference between the Unicode character set, and specific encodings of this character set. This post is a follow-up which describes in detail one particular and popular character set – UTF-8.
It’s imortant to understand that encoding is simply a means of representing a Unicode character in terms of bytes on a disk or in a network stream. In the same sense, the character λ can be encoded in HTML as one of: &lamda; , λ , or λ. These HTML entities are another form of character encoding.
To understand how UTF-8 encodings work, we have to be really certain that you have it in your head that a byte and a character are different things. Just like encoding special characters in HTML, a character in UTF-8 can be made up of one or more bytes. Such encoded characters are called muti-byte characters, because they require 16, 24, 32 (or sometimes but rarely more) bits to represent this character, corresponding of course to 2, 3, and 4 bytes.
Because each character is not made up of a fixed number of bytes, UTF-8 is called a variable-width encoding. Variable-width encodings are popular because they can reduce the storage size of the most common members of that encoding. For example, this page is encoded in UTF-8, and yet almost every character in its source code will only require a single byte.
The UTF-8 encoding is popular because especially for speakers of Latin-based languages (Europe, USA, et al.) it only requires one byte per character for our most common printed characters. Also, and unlike some other character encodigns, existing text encoded as ASCII (the old IBM PC days of 7 bits per byte – the first 128 characters in an ASCII table) translates into UTF-8 without any need for conversion. UTF-8 also supports a very large range of characters (its charater repertoir).
To understand the way UTF-8 works, we have to examine the binary representation of each byte. If the first bit (the high-order bit) is zero, then it’s a single-byte character, and we can directly map its remaining bits to the Unicode characters 0 – 127. If the first bit is a one, then this byte is a member of a multi-byte character (either the first character or some followup of it).
For a multi-byte character (any character whose Unicode number is 128 or above), we need to know how many bytes will make up this character. This is stored in the leading bits of the first byte in the character. We can identify how many total bytes will make up this character by counting the number of leading 1’s before we encounter the first 0. Thus, for the first byte in a multi-byte character, 110xxxxx represents a two-byte character, 1110xxxx represents a three-byte character, and so on.
For secondary bytes in a multi-byte UTF-8 encode character, the first two bits are reserved to indicate that this is a follow-up byte. All secondary and beyond characters are of the form 10xxxxxx. It may seem wasteful to sacrifice two bits of each follow-up byte since by the time we reach this byte, we already know that it’s part of a multi-byte character. Indeed, it is wasteful, but it is intentional, and intended to address systems which do not properly handle multi-byte character encodings. For example, this enables us to identify if a string was encoded with an alternate character encoding but is being parsed as UTF-8. It also enables us to identify strings which were incorrectly truncated.
So characters in UTF-8 can be represented in binary as:
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
If you’re interested in further reading regarding Unicode and character encodings, a good place to start would be Wikipedia:
http://en.wikipedia.org/wiki/Unicode
http://en.wikipedia.org/wiki/Character_encoding
BOM – Is it part of the data?
Posted by Eric in Programming on May 3rd, 2008
This is a post in response to a comment at Ben Nadel’s blog by PaulH which I think is an interesting and important discussion, but sufficiently off-topic to the blog entry at hand that I didn’t want to completely derail the on-topic discussion.
Whereas initially BOM (Byte Order Marker U+FEFF) was intended to indicate the order of the bytes when dealing with systems which may have had little-endian or big-endian CPU’s, and so written characters in a byte order for which they are most suited while potentially interoperating with systems with opposite endianness. Their use in recent years has mostly been as a hint to byte stream -> character array string parsers as to the encoding of the string. For example, you can identify UTF-8 encoded files with a leading BOM because they will start with 0xEF 0xBB 0xBF (Unicode U+FEFF). UTF-16BE will start with 0xFE 0xFF, UTF-16LE will start with 0xFF 0xEF, and so on.
At a conceptual sense, byte order markers are not considered to be part of the data. In fact, U+FEFF is a Zero Width Non-Breaking Space, and is considered obviated within data because of its use as a BOM. You’re supposed to use U+2060 WORD JOINER instead of U+FEFF ZWNBSP. The fact that U+FEFF is a zero-width character is part of why it was chosen as BOM – if a system reads the encoding correctly but doesn’t know how to deal with BOM, it will be interpreted instead as an invisible character – exactly what we’re seeing with CFHTTP.
The problem in this case is that no character, not even zero-width characters are permitted before the processing instruction in an XML document. ColdFusion’s cfhttp function preserves the BOM as part of the data, while it’s xmlParse() function fails to handle it correctly. This is an inconsistency, and I suspect you may be able to start a holy war over which feature has the bug.
I Googled around for a while to see if I could find a source that said definitively whether a BOM is considered part of the Unicode data, or simply a hint which is intended to be dropped as part of the Unicode decoding (eg, we convert multiple bytes into single characters in the case of characters greater than U+00EF, likewise we consume the hint as a means of informing us how the file is encoded, and nothing else). This latter case has always been my understanding of it, and indeed this behavior is reinforced by many systems, including ColdFusion itself in some arenas (eg, reading a file with cffile that contains a leading BOM – the BOM will be discarded). Unfortunately other systems do seem to retain BOM, but it’s impossible to say for those systems whether this was a design decision or failure to address BOM at all as they would look the same to an outside observer.
I couldn’t find much in the way of a definitive statement for or against BOM being retained as part of the character array. The closest I could find is something you alluded to – XML 1.0 specifies that BOM is not considered to be part of the data, and should be used only to identify the endianness of the data being passed to it, and otherwise ignored. That wording is a bit ambiguous since in its original context, it’s talking about how to handle BOM at the start of a byte stream.
As to the applicability of this part of the XML standard to ColdFusion’s XML parser – ColdFusion’s XML parser isn’t dealing with a byte stream, it’s dealing with a character array. By the point we call xmlParse(), we have gotten past our need for the BOM (if we’ve already parsed the byte stream as characters, BOM can no longer affect what we do with those characters), so the XML standard on dealing with BOM no longer applies.
So this all comes down to: Is BOM part of the data or just metadata? Conceptually it’s part of the metadata, whereas the question is should it be preserved in the character array. I land firmly in the camp of it being purely metadata, and with it being desirable to discard it as part of character parsing. ColdFusion treats it as metadata in some instances and as part of the data in other instances, this inconsistency is where we see the original error.
Some software even provides an option as to whether or not BOM should be preserved as part of the data: http://www.webhostingsearch.com/blogs/richard/bom-byte-order-mark-in-biztalk-output/
In any event, CF should be consistent with how it handles BOM, and if it considers it as part of the data, then its string functions should consistently handle it as such (ie, xmlParse() should ignore it). If it considers it to be metadata, then it should always be discarded when parsing the byte stream into a character array. The fact that in some cases it discards BOM when parsing characters says to me that the design decision by Macrodobeia was to discard it when parsing characters, since someone had to have written code to this effect, but that it wasn’t applied consistently throughout.
Unicode: The absolute minimum every developer should know.
Posted by Eric in Programming on May 2nd, 2008
Over the past few years, I’ve been engaged in a few projects which have required international support. Via trial and fire, I’ve learned a fair amount about Unicode and character encoding. I now consider this to be essential knowledge for all programmers – web programmers especially. This becomes particularly important when you start dealing with strict schemas such as XML, which may care greatly what the individual characters are, and whose schema may be broken by invalidly encoding some characters.
There’s a great article here with the absurdly long yet very much accurate title of The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky of Joel on Software fame.
It’s important to realize that Unicode is just a really big alphabet. It’s a collection of letters, glyphs, and symbols from a lot of different languages, some non-languages (eg, there’s a smiley with an eyepatch), and even some fake languages (I believe there’s some Klingon in there). It specifies nothing about what bytes you use to represent a given Unicode character, it just says that character 12345 should be such and such a symbol.
Next comes encoding. Encoding defines how we map a given Unicode character into bytes. There are many encoding schemes out there, some of them favor one language or another (ISO-8859-1, which is common in Microsoft applications favors European languages).
Not all encodings are able to represent the entire Unicode character set. The characters which can be represented by a given character set is known as that set’s character repertoir. The encoding specifies both how to represent characters in terms of bytes on a disk, and also which subset of the Unicode character set this encoding represents. The same bytes interpreted as a different character encoding will yield different glyphs.
Some encodings specify that each character is made up of some pre-determined, fixed number of bytes, while others specify that different characters are made up of different numbers of bytes – usually the most common characters are a single byte, and less common characters are multiple bytes.
One of the most common encodings is UTF-8. This is an incredibly popular character encoding, and for good reason – it’s very flexible, and uses only a single byte for the most common characters, making it also one of the most compact. I’ll talk about it some more tomorrow.
Adobe opens the file formats for SWF and FLA
Posted by Eric in Programming on May 1st, 2008
Adobe is opening up the file formats for SWF and FLA, which is major news! SWF is the format run by Flash Player, and FLA is the source format which is used to create SWF’s. With this documentation, anyone will be able to create their own FLA and SWF creation software. Previously this had been substantially reverse engineered, but there were still bits and pieces which had not been figured out.


Recent Comments