Posts Tagged threads

CFThread and dividing up work

CFThread is a wonderful addition to ColdFusion 8. It lets you perform parallel actions within your code. However, parallel programming is a complex beast under the best of circumstances.

One of the early things to realize in CF8’s threading support is that it makes a deep copy of the local variables (ala Duplicate()) when you start the thread. In this way you have a lot of thread safety, you can access and change even variables defined outside the thread space, and you really have a copy of that variable local to your thread. You don’t have to worry about other threads changing the value while you’re using it, and you don’t even have to worry about goofing it up for the parent page (the page thread).

Dividing Up the Work
A common use for threading is to divide up a lot of work so that it can be done in parallel. For example, let’s say you have a script which aggregates RSS feeds from 1,000 external sites. All that HTTP stuff is pretty quiet work, you spend a lot of time waiting for responses and the like. It’s a good candidate for parallelizing the work.

With 1,000 requests to make, it’s not a good idea to just create 1,000 threads – you’ll tie up a lot of TCP connections on your server, plus you’ll use up a lot of memory (remember, each thread is going to operate in its own memory space). So let’s decide somewhat arbitrarily that we’re only going to run 10 requests in parallel.

// How many feeds will each thread handle?
eachThreadDoesCount = ceiling(ArrayLen(feedURLs) / parallelThreads);

threadID – 1) * eachThreadDoesCount + 1>
threadID * eachThreadDoesCount, ArrayLen(feedURLs))>

Looks pretty easy, eh? There’s a critical error there though which is not obvious even by Adobe’s documentation. I’ve bolded it for you. This will end up with the earliest URLs getting fetched numerous times, and the later URLs not getting fetched at all.

The reason for this has to do in some way with how ColdFusion initiates its threads under the hood. To me, it looks like when tag is encountered, it actually spawns an initial thread to do the local variable copying. All threads which are started before this initial thread finishes the copying will get an identical set of local variables – this includes the threadID variable used in the loop. I’m not certain if that’s actually what’s going on under the hood, but the behavior is similar as if that is the case. In any event, you cannot rely on variables changed by the parent page (”page thread” by Adobe’s documentation) appearing in your cfthreads, even if that change was made before your specific thread was launched, but after the initial thread was launched.

It seems bizarre, but let me give you a piece of sample code which demonstrates it. This code is non-deterministic for me – that is to say some times I see a “correct” result, but then I immediately refresh it and get a different result.


Here you should see each worker thread having a unique loopID. Instead, for example, my code shows Worker 1 has a loopID of 1, Worker 2 = 2, Worker 3 = 3, Worker 4 = 4, but Workers 5 through 20 have a loopID of 5. If I refresh, it’s different. I’ve seen all 20 workers starting with a loopID of 1, and I’ve seen the first fiew having a loopID of 1, then the next few having a loopID of 5, then the next few having a loopID of 8, etc.

So how do you safely determine which worker you are so you know which of the set of work you should be doing? The answer, as Ray posted in a comment (and contrary to my more elaborate work-around involving locking), is to use the attributes scope:


Basically the idea is you can pass additional custom attributes to and these show up as values in a structure named “attributes” available only within the scope of your thread.

,

2 Comments

Real Time Command Execution Feedback

Did you ever write a utility ColdFusion script which uses <cfexecute> to run a command and send output back to the browser? It makes for convenient and monitorable remote execution of certain repetitive tasks. My most common use for this sort of thing is for example an rsync process which can be invoked from anywhere in the world, and most recently I’ve been working with Selenium-RC to set up regression test scenarios which can be initiated by business users and business analysts without having to have Selenium IDE installed or know how to use it.

I’ve always found it frustrating though when the task is long-running, and potentially error-prone to not know the success or failure, until the entire command has been executed, and even more frustrating not knowing if it has hung up for some reason today, or does it just have a lot more work to do today than normal?

This little snippet will use Java runtime to capture and pipe the output of the program back to the browser in real time. There’s a couple of caveats surrounding needing to not be used inside a forced-buffer area (like <cfsavecontent>), but otherwise this should work just fine. That means you can’t really use it inside most modern CF frameworks which depend heavily on <cfsavecontent> and the like.

Standard input (stdin) is shut down right at the start of execution; if you wanted to interact with the program in some way (such as to script some responses to prompts), you could undo that and write to it. Standard output (stdout) and standard error (stderr) are sent to the browser and flushed in nearly real time (stderr outputs in red to boot). I use a non-busy sleep via a Java thread to check in on the running program once a second for new output. Return value is a structure containing the elements exitValue, stdOut, and stdErr, so you can do further processing with it after the fact.

Anyway, enough blather, here is the code. This is not hyper-efficient (too many string concatenations and HTMLEditFormats), so I don’t recommend you use it in any high volume situations, especially if there’s a lot of output expected from your command, but it’s been sufficient for my needs.

Example usage:

, ,

1 Comment