Archive for March, 2012


Building libhdfs on Solaris

Spent last couple days getting a libhdfs dependent project working on Solaris. Essentially this package which is part of Hadoop core allows you to interact with HDFS within your C program using Java JNI.

These instructions are for building 64bit libhdfs libraries on Solaris AMD64.

Step 1

Get a copy of Hadoop on your Solaris system.

Odds are you will be behind a firewall. Suggest you procure a Linux system outside of the firewall, install Squid proxy, and then ssh from your Linux system to your Solaris system forwarding port 3128. Setup proxy variables on the Solaris system. You will need external HTTP to satisfy Hadoop dependencies from Ivy.

export http_proxy=127.0.0.1:3128
export ANT_OPTS="-Dhttp.proxyHost=127.0.0.1 -Dhttp.proxyPort=3128"

You will also need some other GNU packages to move forward. I used m4-1.4.15, automake-1.11, and autoconf-2.68. For these I downloaded source, unpacked, configure –prefix=$HOME, and make install where ~/bin is in my path. Lastly you need 64bit JDK (include, jre/lib/amd64/server) where JAVA_HOME is set accordingly. Oh yea, you need ant too.

Step 2

Here we will start a build that will eventually fail.

Unpackage your distribution and move into its base directory.

ant -Divy.checksums="" compile-c++-libhdfs -Dlibhdfs=1 -Dcompile.c++=1 -Dneed.libhdfs.configure=1

Expect to see some Ivy output then a configure start and lastly a build fail.

Step 3

Tweak the configuration to build.

cd src/c++/libhdfs
export JVM_ARCH=64
CC=cc ./configure

Next you need to modify the Makefile.

Change1: Find this "-Wl,-x" and replace with "-Wl -x"
Change2: delete all the -L entries for JVM lib directories except the one that has amd64/server in it

Lastly you are ready to build

make

Step 4

You will find the libraries here:

cd src/c++/libhdfs/.libs

Step 5

To use the library you built you will need:

  • The headers in src/c++/libhdfs,
  • The library you built -L<directory> and -lhdfs
  • The JVM library -L<JAVA_HOME>/jre/amd64/server and -ljvm

Strata Recap

Well today was the last day of Strata 2012 in Santa Clara, CA.

It was my first Strata conference and I’ve have to say I wasn’t that impressed. As far as conference presentations go, there’s always some winners and some losers. It felt like there were too many losers at Strata. Even some of the keynotes were embarrassingly weak.

So what was good?

  1. Avinash Kaushik quote Donald Rumsfield (known knows, known unknowns, and unknowns). A very energetic, engaging, and entertaining presenter. His material was interesting too. His exploration on improving the typical sort to focus not on the typical min/max but the estimate max value was inciteful as well as his examples illustrating techniques to make the reporting more actionable with suggested correlations and variable sensitivities for identifying unknowns.
  2. Mark Madsen relate BigData to the introduction of Data Warehousing. Not so much in a condescending manner but more about applying lessons previously learned. One main take-away, make your BigData platform reusable. He alluded to high-value projects that eventually fall through the cracks. We need to stay hungry.
  3. Arun Murthy describe what’s new in Hadoop 0.23 and what it takes. I have to say the novelty of hearing Chairman/Release Manager was more cool than the 0.23 features. With releases like that you can see the hype curve drop a little steeper. When asked about workload management for multi-tenant systems his answer was Hadoop already does that and apparently the answer is simply bigger clusters with more powerful components.
  4. Mike Oleson from Cloudera had a short keynote that was going interesting places. Platform agnostic, focused on industry areas for BigData — drugs, gun, and oil. Would of been nice to hear him take the story further.

Other notable entries: Alasdair Allan (illustrated all the data our mobile devices leak regardless or for lack of policy and its implications to us)

p.s. I had dinner with some Microsoft folks and they discussed their new appliance — MPP SQL Server on HP hardware. I think it scales to four cabinets @ 600TB. For small Microsoft shops it could make sense and gives Microsoft something to sell there.

Bare bones text functions

What would be your list of essential text functions?

Well, here’s what I’m thinking:

  1. Regular expression replace (nice for cleaning)
  2. Regular expression match (nice for filtering out garbage)
  3. Regular expression split (very useful)
  4. word difference (ie Levenshtein)
  5. ngram (word / character — regex split would give same thing but this might make simpler)
  6. sentences (similar story as ngram)
  7. various language flavors of snowball stemmer (not perfect but simple)
  8. wordnet parser ( the confidence level that words in phrase are these language component types)
  9. HTML parser (seems like heavy fanboy mania on beautiful soup but don’t know how it really compares to tidy-ish)

Beyond the simple, these are supplementaries I’ve seen pop up in different projects from last couple months:

  1. Shingle and hash for potential duplication identification
  2. Conditional random fields for attribute extraction
  3. Miscellaneous scoring on: target word ratio to corpus, target word position to begin/end, target word count, etc
Follow

Get every new post delivered to your Inbox.

Join 25 other followers