I’ve written some software to help push large volumes of data from one platform to HDFS.
For this example, let’s say I’m trying to push about 10B rows or around 5TB of data.
Here are the two more annoying problems that have arose:
1) ConcurrentModificationException
So my process is multi-threaded. Each thread needs to write to a HDFS file. Often threads will start running at nearly the same time.
I kept getting “ConcurrentModificationException” when I tried to execute FileSystem.get() specifically where Configuration was trying to load resources via some Iterator.
It was annoying but at least solve-able. Had to go static volatile variable with explicit synchronization on the export class’s initialization check/launch piece.
I was happy to see it go.
2) NotYetReplicatedException
I have not been able to get this one to go.
These exceptions are intermittent. Sometimes they are merely warnings and go away after couple appearances. Other times they are run out of retries and cause havoc.
I suspect the particular hosts where this software is running may be pushing excessive data into this particular cluster’s HDFS.
Evidently there is some async behavior involved in the verification of writing a block and requesting the next block, it’s not exactly clear to me the true nature of the problem.
In this case, I’m feeling the problem isn’t the client code but Hadoop itself. It seems unfortunate as this cluster is good sized at one thousand nodes.
To be fair, there is a lot of HDFS writing going on on the client hosts I’m working on.
However… I find this a pretty weak situation. I struggle to fathom how such a large cluster is choking on something like writing data.
Hopefully I get to the bottom of this.
Update 2012-05-15
Appears found solution or workaround for block not yet replicated fatal exceptions:
<property>
<name>dfs.client.block.write.locateFollowingBlock.retries</name>
<value>10</value>
</property>
I can’t say whether this is generic Apache configuration or something site specific.
I can say it looks like it did the trick
(knock on wood)
Update 2012-06-26
The Hadoop FileSystem object is static and indeed multiple threads have to synchronize on shared objects.
There is a list holding outbound packets. The size of the packet and list is hard coded, at least in my versions (0.22 I think).
In that sense, writing multiple HDFS text files from individual threads won’t get you very far.
Sequence files seem to have less contention.