Hadoop Reduce Output File Never Created for Large Data
NickName:mj_ Ask DateTime:2013-05-15T03:51:31

Hadoop Reduce Output File Never Created for Large Data

I'm writing an application in Java on Hadoop 1.1.1 (Ubuntu) that compares strings in order to find the longest common substrings. I've got both the map and reduce phases running successfully for small data sets. Whenever I increase the size of the input, my reduce output never appears in my target output directory. It doesn't complain at all which makes this all the weirder. I'm running everything in Eclipse and I have 1 mapper and 1 reducer.

My reducer finds the longest common substring in a collection of strings and then emits the substring as the key and the index of the string that contained it as the value. I've got a short example.

Input Data

0: ALPHAA

1: ALPHAB

2: ALZHA

Output Emitted

Key: ALPHA  Value: 0

Key: ALPHA  Value: 1

Key: AL  Value: 0

Key: AL  Value: 1

Key: AL  Value: 2

The first two input strings both share "ALPHA" as the common substring while all three share "AL". I end up indexing the substrings and write them into a database when the process is complete.

An additional observation, I can see that intermediate files are created in my output directory, it's just that the reduced data is never put into an output file.

I've pasted the Hadoop output log below and it claims that it has a number of output records from the reducer, it's just that they seem to disappear. Any suggestions are appreciated.

Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Use GenericOptionsParser for parsing the arguments. Applications should implement Tool     for the same.
No job jar file set.  User classes may not be found. See JobConf(Class) or     JobConf#setJar(String).
Total input paths to process : 1
Running job: job_local_0001
setsid exited with exit code 0
 Using ResourceCalculatorPlugin :     org.apache.hadoop.util.LinuxResourceCalculatorPlugin@411fd5a3
Snappy native library not loaded
io.sort.mb = 100
data buffer = 79691776/99614720
record buffer = 262144/327680
 map 0% reduce 0%
Spilling map output: record full = true
bufstart = 0; bufend = 22852573; bufvoid = 99614720
kvstart = 0; kvend = 262144; length = 327680
Finished spill 0
Starting flush of map output
Finished spill 1
Merging 2 sorted segments
Down to the last merge-pass, with 2 segments left of total size: 28981648 bytes

Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting

Task attempt_local_0001_m_000000_0 done.
 Using ResourceCalculatorPlugin :     org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3aff2f16

Merging 1 sorted segments
Down to the last merge-pass, with 1 segments left of total size: 28981646 bytes

 map 100% reduce 0%
reduce > reduce
 map 100% reduce 66%
reduce > reduce
 map 100% reduce 67%
reduce > reduce
reduce > reduce
 map 100% reduce 68%
reduce > reduce
reduce > reduce
reduce > reduce
 map 100% reduce 69%
reduce > reduce
reduce > reduce
 map 100% reduce 70%
reduce > reduce
job_local_0001
Job complete: job_local_0001
Counters: 22
  File Output Format Counters 
    Bytes Written=14754916
  FileSystemCounters
    FILE_BYTES_READ=61475617
    HDFS_BYTES_READ=97361881
    FILE_BYTES_WRITTEN=116018418
    HDFS_BYTES_WRITTEN=116746326
  File Input Format Counters 
    Bytes Read=46366176
  Map-Reduce Framework
    Reduce input groups=27774
    Map output materialized bytes=28981650
    Combine output records=0
    Map input records=4629524
    Reduce shuffle bytes=0
    Physical memory (bytes) snapshot=0
    Reduce output records=832559
    Spilled Records=651304
    Map output bytes=28289481
    CPU time spent (ms)=0
    Total committed heap usage (bytes)=2578972672
    Virtual memory (bytes) snapshot=0
    Combine input records=0
    Map output records=325652
    SPLIT_RAW_BYTES=136
    Reduce input records=27774
reduce > reduce
reduce > reduce

Copyright Notice:Content Author:「mj_」,Reproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/16551710/hadoop-reduce-output-file-never-created-for-large-data

More about “Hadoop Reduce Output File Never Created for Large Data” related questions

Hadoop Reduce Output File Never Created for Large Data

I'm writing an application in Java on Hadoop 1.1.1 (Ubuntu) that compares strings in order to find the longest common substrings. I've got both the map and reduce phases running successfully for sm...

Show Detail

Handling large output values from reduce step in Hadoop

During the Reduce phase of my MapReduce program, the only operation I'm performing is to concatonate each value in the provided Iterator, as below: public void reduce(Text key, Iterator<text>

Show Detail

Hadoop Pipes: how to pass large data records to map/reduce tasks

I'm trying to use map/reduce to process large amounts of binary data. The application is characterized by the following: the number of records is potentially large, such that I don't really want to...

Show Detail

Compress output of Hadoop Archive tool

I'm using Hadoop Archive for reduce number of files in my Hadoop cluster, but for data retention, I want to keep my data as long as possible. Then the problem is Hadoop Archive not reduce folder si...

Show Detail

How to decompress the hadoop reduce output file end with snappy?

Our hadoop cluster using snappy as default codec. Hadoop job reduce output file name is like part-r-00000.snappy. JSnappy fails to decompress the file bcz JSnappy requires the file start with SNZ. ...

Show Detail

Hadoop: Start of Reduce phase: FileNotFoundException output/file.out.index does not exist

I'm trying to get Hadoop 3.1.0 to run on a Windows 10 system. The exception I'm receiving is: 2019-11-19 14:49:13,310 INFO mapreduce.Job: map 100% reduce 0% 2019-11-19 14:49:13,325 INFO mapred.

Show Detail

hadoop writing output to hdfs file

I have written my first map reduce program and when I ran it in eclipse it writes to the output file and works as expected. However when I run it from command line using hadoop jar myjar.jar the

Show Detail

Hadoop stuck on reduce 67% (only with large data)

I'm a beginner at Hadoop and Linux. The Problem Hadoop reduce stuck (or move really really slow) when the input data is large (e.x. 600k rows or 6M rows) even though the Map and Reduce functions are

Show Detail

Joining very large datasets using hadoop map reduce

What is the best approach to perform a join on 2 very large datasets using hadoop map reduce? Distributed cache or temporary storage like arraylists would not be able to store intermediate data since

Show Detail

Hadoop producing no output?

I've recently started learning how to use the Hadoop system, and decided it's time to try writing some code. Before that, I wanted to try running the examples seen in the Getting Started page. Howe...

Show Detail