Archive for April, 2012


Hadoop Secure Impersonation

As I mentioned in a previous post, I’ve been working on some import/export functionality.

One side of the fence is a database and another is HDFS.

I have an authenticated user on the database who is authorized to access some data and wants it on HDFS owned by him in his home directory. How do I propagate the authentication/authorization?

Turns out that Hadoop does support a secure impersonation feature. In some sense it’s kind of close to how this database supports a proxy impersonation.

In essence, we will configure the Hadoop cluster to recognize the UNIX process account running the middle-ware component as a super-user. We will further specify the IP’s that the proxy requests can originate from. Lastly we specify to what groups of users the proxy super user can impersonate. Of course the middle-ware components will need to invoke the security proxy.

So that’s how it’s supposed to work. In progress for setting up environment for testing. Hope things go smoothly but already know once the Hadoop cluster goes to Kerberos this will break until the middle-ware process account goes to Active-Directory which will be an issue.

Exports at 28GB/s

Been working on some database import/export infrastructure.

It can get pretty exciting seeing a see each host in a 24 node cluster max out it’s 10Gb/s NIC. Moving data at 28GB/s is pretty exciting.

The speed does come with a drawback, suppose your 24 node bridge cluster talks to 256 database nodes. Now each of those database nodes has a 1Gb/s NIC. Unfortunately one is running at 100Mb/s.

So the crux of parallel processing comes true, you are only as fast as your slowest component.

Back to the export, after the joy of watching the data stream like there is no tomorrow you are left watching one single stream take very close to 35 minutes while everything else finished within 4 minutes.

Try waiting on a bench for 4 minutes then try again for 35 minutes, feels a little different.

Can’t wait for 100GB/s.

Google’s Guava

In the process of working on some miscellaneous Java projects I noticed this guava jar.

Today I did a google search on it and low and behold it’s a Google Java library of assorted functionality.

I have passing familiarity with Apache’s common libraries but I had never heard of Guava. I wonder if this is because my ignorance or because it’s not a hugely known product.

I took a peek and what I saw warranted future exploration. As I began the exploration I was initially excited and thought the library would be very useful. Further into the API I began to lose the enthusiasm and began to doubt the library’s usefulness to me.

One glaring issue for me was handling of big and little endian in terms of byte arrays and conversion to primitives. I need support for both but what I found in the library was hit and miss depending on the particular class.

In the end it seems the collections are good. Also the base stuff has some interesting things that could be useful like more string functionality. Overall it’s interesting but not that useful to me.

Fighting multi-thread HDFS writes

I’ve written some software to help push large volumes of data from one platform to HDFS.

For this example, let’s say I’m trying to push about 10B rows or around 5TB of data.

Here are the two more annoying problems that have arose:

1) ConcurrentModificationException

So my process is multi-threaded. Each thread needs to write to a HDFS file. Often threads will start running at nearly the same time.

I kept getting “ConcurrentModificationException” when I tried to execute FileSystem.get() specifically where Configuration was trying to load resources via some Iterator.

It was annoying but at least solve-able. Had to go static volatile variable with explicit synchronization on the export class’s initialization check/launch piece.

I was happy to see it go.

2) NotYetReplicatedException

I have not been able to get this one to go.

These exceptions are intermittent. Sometimes they are merely warnings and go away after couple appearances. Other times they are run out of retries and cause havoc.

I suspect the particular hosts where this software is running may be pushing excessive data into this particular cluster’s HDFS.

Evidently there is some async behavior involved in the verification of writing a block and requesting the next block, it’s not exactly clear to me the true nature of the problem.

In this case, I’m feeling the problem isn’t the client code but Hadoop itself. It seems unfortunate as this cluster is good sized at one thousand nodes.

To be fair, there is a lot of HDFS writing going on on the client hosts I’m working on.

However… I find this a pretty weak situation. I struggle to fathom how such a large cluster is choking on something like writing data.

Hopefully I get to the bottom of this.

Update 2012-05-15

Appears found solution or workaround for block not yet replicated fatal exceptions:

<property>
<name>dfs.client.block.write.locateFollowingBlock.retries</name>
<value>10</value>
</property>

I can’t say whether this is generic Apache configuration or something site specific.

I can say it looks like it did the trick :)

(knock on wood)

Update 2012-06-26

The Hadoop FileSystem object is static and indeed multiple threads have to synchronize on shared objects.

There is a list holding outbound packets. The size of the packet and list is hard coded, at least in my versions (0.22 I think).

In that sense, writing multiple HDFS text files from individual threads won’t get you very far.

Sequence files seem to have less contention.

Follow

Get every new post delivered to your Inbox.

Join 25 other followers