2011-02-19

Parallel collections in Scala



The other day I got inspired to check out the new parallel collection feature of Scala 2.9.
Since there isn’t a release yet I braced myself for some interesting time. However, I quickly found it much easier then expected.


Download and unpack the latest build of Scala 2.9 on scala-lang http://www.scala-lang.org/node/212/distributions. Since I didn’t want to mess with my current environment I decided to do the tests in REPL. To start a new Scala consol just go into the newly unpacked distribution and run there.


Then for something to do:
(1 to 1000000).toArray.map(_*2)
(1 to 1000000).toArray.par.map(_*2)
Turning on the parallel feature by slapping on a .par (or .toParIterator/.toParSeq … depending on collection type). Use Array in preference of List since an indexed collection is better suited for parallelization. Of cause, it is up to you to guarantee that what is done to the elements in the collection really can be done safely in parallel, but mechanics of it really is this simple!
However, when running it turned out that the parallel version took more time than the normal one?!


I tried again:
case class Call(i: Int) { 
  def call { Thread.sleep(100) } 
}

val arr = (1 to 100).toArray.map(new Call(_))
arr.map(_.call)

val parr = arr.par
parr.map(_.call)
Trust restored, twice the speed for the parallel version.


So what was the problem, why isn’t the parallel version always faster? The parallel collections implementation in Scala uses the fork/join thread-pool from Doug Lea’s concurrency enhancement to Java, and this isn’t introduced until java 7. If the fork/join thread-pool isn’t available scala reverts back to a normal thread-pool and if you try to do processor intensive things in threads only, the result tend to be “not so good”.


So, next step was to make use of all cores.
After some searching I found the project openjdk-osx-build on google code which continuously build binary releases of Java 7 for OsX and downloaded and installed the latest build. Then it was only a matter of changing the environment in “Java Preferences” to select Java 7 first and start a new REPL consol.


Scala REPL nicely report Scala 2.9 and Java 7, now both earlier examples are faster in parallel mode, the “memory intensive” example almost twice as fast and the “thread sleep” example up to 10 times as fast!


So now it is time to buckle up, start writing immutable objects and reap the benefits!


Runnable code at: 
https://gist.github.com/829556
And a demonstration:

No comments:

Post a Comment