Tag Archive: hdfs


Hadoop Secure Impersonation

As I mentioned in a previous post, I’ve been working on some import/export functionality.

One side of the fence is a database and another is HDFS.

I have an authenticated user on the database who is authorized to access some data and wants it on HDFS owned by him in his home directory. How do I propagate the authentication/authorization?

Turns out that Hadoop does support a secure impersonation feature. In some sense it’s kind of close to how this database supports a proxy impersonation.

In essence, we will configure the Hadoop cluster to recognize the UNIX process account running the middle-ware component as a super-user. We will further specify the IP’s that the proxy requests can originate from. Lastly we specify to what groups of users the proxy super user can impersonate. Of course the middle-ware components will need to invoke the security proxy.

So that’s how it’s supposed to work. In progress for setting up environment for testing. Hope things go smoothly but already know once the Hadoop cluster goes to Kerberos this will break until the middle-ware process account goes to Active-Directory which will be an issue.

Fighting multi-thread HDFS writes

I’ve written some software to help push large volumes of data from one platform to HDFS.

For this example, let’s say I’m trying to push about 10B rows or around 5TB of data.

Here are the two more annoying problems that have arose:

1) ConcurrentModificationException

So my process is multi-threaded. Each thread needs to write to a HDFS file. Often threads will start running at nearly the same time.

I kept getting “ConcurrentModificationException” when I tried to execute FileSystem.get() specifically where Configuration was trying to load resources via some Iterator.

It was annoying but at least solve-able. Had to go static volatile variable with explicit synchronization on the export class’s initialization check/launch piece.

I was happy to see it go.

2) NotYetReplicatedException

I have not been able to get this one to go.

These exceptions are intermittent. Sometimes they are merely warnings and go away after couple appearances. Other times they are run out of retries and cause havoc.

I suspect the particular hosts where this software is running may be pushing excessive data into this particular cluster’s HDFS.

Evidently there is some async behavior involved in the verification of writing a block and requesting the next block, it’s not exactly clear to me the true nature of the problem.

In this case, I’m feeling the problem isn’t the client code but Hadoop itself. It seems unfortunate as this cluster is good sized at one thousand nodes.

To be fair, there is a lot of HDFS writing going on on the client hosts I’m working on.

However… I find this a pretty weak situation. I struggle to fathom how such a large cluster is choking on something like writing data.

Hopefully I get to the bottom of this.

Update 2012-05-15

Appears found solution or workaround for block not yet replicated fatal exceptions:

<property>
<name>dfs.client.block.write.locateFollowingBlock.retries</name>
<value>10</value>
</property>

I can’t say whether this is generic Apache configuration or something site specific.

I can say it looks like it did the trick :)

(knock on wood)

Update 2012-06-26

The Hadoop FileSystem object is static and indeed multiple threads have to synchronize on shared objects.

There is a list holding outbound packets. The size of the packet and list is hard coded, at least in my versions (0.22 I think).

In that sense, writing multiple HDFS text files from individual threads won’t get you very far.

Sequence files seem to have less contention.

Follow

Get every new post delivered to your Inbox.

Join 25 other followers