Coding a custom InputFormat to read a sequence file into Hive
Okay, so I have a sequence file that I want to be read into a table in
Hive. From my understanding a sequence file is composed of keys and
values. Both my keys and values are Java objects. The key object has a few
things stored in it, such as a name, time, etc. Each value object contains
a large 1D array of data.
I know I need to create a custom InputFormat, SerDe, and OutputFormat.
However, I am having a lot of trouble figuring how to get started on these
pieces of code. My focus currently is the InputFormat. I have a small
template for the InputFormat and I have no idea how to fill in the
getSplits() and getRecordReader() methods.
My goal is to split the entire sequence file into every data point. So,
each value object has multiple data points, thus each value object would
have to be split into multiple objects.
I'm hoping to get a good understanding of what is going on. So, any help
in explaining what these classes do exactly would be great. I think I am
having a lot of trouble understanding what the inputs are into these
methods and how I can access the data to be split properly.
Much appreciated! Jeff
No comments:
Post a Comment