I need something like Flume but not on the infrastructure side, more on the receiving, accumulation, and preparation side.

Say I have a web logs. I can get them near Hadoop easily. I can get them in Hadoop pretty easy. And then it’s like what’s next?

  • Sessionize
  • Bot Classification
  • Dimension blow-out
  • Core value extraction
  • Package
  • Publish
Presently we do most of this in SQL. It works. But how could we leverage Hadoop for this?
Are there existing packages that have some of this functionality? Would it all be hand coded map reduce jobs?
Any way, I’m not seeing Flume fill this role so I called it out but with no fault.