Tag Archive: hadoop


If you use BinaryComparable.getBytes(), you will need to subset that array to the length in BinaryComparable.getLength().

Otherwise you will find junk from previous rows in your byte array.

Might be worth documenting the API, just saying.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/BinaryComparable.html#getBytes()

I am getting excited about seeing large data systems employ InfiniBand. Granted, high speed interconnects have always played an important role for high performance computing. What’s exciting for me is the implications of remote memory access and diversifying of capability this provides.

For example, suppose I have a large database system that is pretty well balanced in CPU and IO. New use cases arrive that require a very large CPU draw with low IO impact. One option might be to expand the cluster but doing so would oversubscribe my IO capability at significant cost (IO is usually most expensive). Alternately, with a system configured to use InfiniBand, the software could employ remote memory access to effectively “beam” a parcel of data from my existing cluster to a new high CPU cluster which can burn through the necessary computation and “beam” back the results. Enabling a cost effective high performance co-processor capability.

Why I think this is compelling is how InfiniBand abstracts this capability into a simple implementation and the speed at which InfiniBand can perform this work.

I wouldn’t say InfiniBand is new but I do think more enterprise-ish parallel computing systems are embracing InfiniBand (Teradata, Exadata). I wondered what Hadoop might get out of InfiniBand, interestingly enough some folks already took a look at this (link). What I gleaned, Hadoop’s software doesn’t embrace the necessary protocols (NIO) to take advantage of InfiniBand from a socket perspective and its is probably a long ways away from embracing the remote memory access model.

 

 

 

 

Hadoop Secure Impersonation

As I mentioned in a previous post, I’ve been working on some import/export functionality.

One side of the fence is a database and another is HDFS.

I have an authenticated user on the database who is authorized to access some data and wants it on HDFS owned by him in his home directory. How do I propagate the authentication/authorization?

Turns out that Hadoop does support a secure impersonation feature. In some sense it’s kind of close to how this database supports a proxy impersonation.

In essence, we will configure the Hadoop cluster to recognize the UNIX process account running the middle-ware component as a super-user. We will further specify the IP’s that the proxy requests can originate from. Lastly we specify to what groups of users the proxy super user can impersonate. Of course the middle-ware components will need to invoke the security proxy.

So that’s how it’s supposed to work. In progress for setting up environment for testing. Hope things go smoothly but already know once the Hadoop cluster goes to Kerberos this will break until the middle-ware process account goes to Active-Directory which will be an issue.

Fighting multi-thread HDFS writes

I’ve written some software to help push large volumes of data from one platform to HDFS.

For this example, let’s say I’m trying to push about 10B rows or around 5TB of data.

Here are the two more annoying problems that have arose:

1) ConcurrentModificationException

So my process is multi-threaded. Each thread needs to write to a HDFS file. Often threads will start running at nearly the same time.

I kept getting “ConcurrentModificationException” when I tried to execute FileSystem.get() specifically where Configuration was trying to load resources via some Iterator.

It was annoying but at least solve-able. Had to go static volatile variable with explicit synchronization on the export class’s initialization check/launch piece.

I was happy to see it go.

2) NotYetReplicatedException

I have not been able to get this one to go.

These exceptions are intermittent. Sometimes they are merely warnings and go away after couple appearances. Other times they are run out of retries and cause havoc.

I suspect the particular hosts where this software is running may be pushing excessive data into this particular cluster’s HDFS.

Evidently there is some async behavior involved in the verification of writing a block and requesting the next block, it’s not exactly clear to me the true nature of the problem.

In this case, I’m feeling the problem isn’t the client code but Hadoop itself. It seems unfortunate as this cluster is good sized at one thousand nodes.

To be fair, there is a lot of HDFS writing going on on the client hosts I’m working on.

However… I find this a pretty weak situation. I struggle to fathom how such a large cluster is choking on something like writing data.

Hopefully I get to the bottom of this.

Update 2012-05-15

Appears found solution or workaround for block not yet replicated fatal exceptions:

<property>
<name>dfs.client.block.write.locateFollowingBlock.retries</name>
<value>10</value>
</property>

I can’t say whether this is generic Apache configuration or something site specific.

I can say it looks like it did the trick :)

(knock on wood)

Update 2012-06-26

The Hadoop FileSystem object is static and indeed multiple threads have to synchronize on shared objects.

There is a list holding outbound packets. The size of the packet and list is hard coded, at least in my versions (0.22 I think).

In that sense, writing multiple HDFS text files from individual threads won’t get you very far.

Sequence files seem to have less contention.

Building libhdfs on Solaris

Spent last couple days getting a libhdfs dependent project working on Solaris. Essentially this package which is part of Hadoop core allows you to interact with HDFS within your C program using Java JNI.

These instructions are for building 64bit libhdfs libraries on Solaris AMD64.

Step 1

Get a copy of Hadoop on your Solaris system.

Odds are you will be behind a firewall. Suggest you procure a Linux system outside of the firewall, install Squid proxy, and then ssh from your Linux system to your Solaris system forwarding port 3128. Setup proxy variables on the Solaris system. You will need external HTTP to satisfy Hadoop dependencies from Ivy.

export http_proxy=127.0.0.1:3128
export ANT_OPTS="-Dhttp.proxyHost=127.0.0.1 -Dhttp.proxyPort=3128"

You will also need some other GNU packages to move forward. I used m4-1.4.15, automake-1.11, and autoconf-2.68. For these I downloaded source, unpacked, configure –prefix=$HOME, and make install where ~/bin is in my path. Lastly you need 64bit JDK (include, jre/lib/amd64/server) where JAVA_HOME is set accordingly. Oh yea, you need ant too.

Step 2

Here we will start a build that will eventually fail.

Unpackage your distribution and move into its base directory.

ant -Divy.checksums="" compile-c++-libhdfs -Dlibhdfs=1 -Dcompile.c++=1 -Dneed.libhdfs.configure=1

Expect to see some Ivy output then a configure start and lastly a build fail.

Step 3

Tweak the configuration to build.

cd src/c++/libhdfs
export JVM_ARCH=64
CC=cc ./configure

Next you need to modify the Makefile.

Change1: Find this "-Wl,-x" and replace with "-Wl -x"
Change2: delete all the -L entries for JVM lib directories except the one that has amd64/server in it

Lastly you are ready to build

make

Step 4

You will find the libraries here:

cd src/c++/libhdfs/.libs

Step 5

To use the library you built you will need:

  • The headers in src/c++/libhdfs,
  • The library you built -L<directory> and -lhdfs
  • The JVM library -L<JAVA_HOME>/jre/amd64/server and -ljvm

Strata Recap

Well today was the last day of Strata 2012 in Santa Clara, CA.

It was my first Strata conference and I’ve have to say I wasn’t that impressed. As far as conference presentations go, there’s always some winners and some losers. It felt like there were too many losers at Strata. Even some of the keynotes were embarrassingly weak.

So what was good?

  1. Avinash Kaushik quote Donald Rumsfield (known knows, known unknowns, and unknowns). A very energetic, engaging, and entertaining presenter. His material was interesting too. His exploration on improving the typical sort to focus not on the typical min/max but the estimate max value was inciteful as well as his examples illustrating techniques to make the reporting more actionable with suggested correlations and variable sensitivities for identifying unknowns.
  2. Mark Madsen relate BigData to the introduction of Data Warehousing. Not so much in a condescending manner but more about applying lessons previously learned. One main take-away, make your BigData platform reusable. He alluded to high-value projects that eventually fall through the cracks. We need to stay hungry.
  3. Arun Murthy describe what’s new in Hadoop 0.23 and what it takes. I have to say the novelty of hearing Chairman/Release Manager was more cool than the 0.23 features. With releases like that you can see the hype curve drop a little steeper. When asked about workload management for multi-tenant systems his answer was Hadoop already does that and apparently the answer is simply bigger clusters with more powerful components.
  4. Mike Oleson from Cloudera had a short keynote that was going interesting places. Platform agnostic, focused on industry areas for BigData — drugs, gun, and oil. Would of been nice to hear him take the story further.

Other notable entries: Alasdair Allan (illustrated all the data our mobile devices leak regardless or for lack of policy and its implications to us)

p.s. I had dinner with some Microsoft folks and they discussed their new appliance — MPP SQL Server on HP hardware. I think it scales to four cabinets @ 600TB. For small Microsoft shops it could make sense and gives Microsoft something to sell there.

Follow

Get every new post delivered to your Inbox.

Join 25 other followers