Mapreduce Map Class Setup Method Read a File

Hadoop - MapReduce

MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable mode.

What is MapReduce?

MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains ii of import tasks, namely Map and Reduce. Map takes a gear up of data and converts it into another set of data, where individual elements are broken down into tuples (cardinal/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those information tuples into a smaller set up of tuples. As the sequence of the name MapReduce implies, the reduce chore is always performed afterwards the map job.

The major reward of MapReduce is that information technology is like shooting fish in a barrel to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to utilise the MapReduce model.

The Algorithm

More often than not MapReduce paradigm is based on sending the computer to where the data resides!
MapReduce program executes in three stages, namely map phase, shuffle phase, and reduce stage.
- Map stage − The map or mapper'southward job is to process the input information. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the information and creates several small chunks of data.
- Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer'south job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
The framework manages all the details of information-passing such every bit issuing tasks, verifying task completion, and copying information around the cluster between the nodes.
Almost of the computing takes place on nodes with data on local disks that reduces the network traffic.
Later completion of the given tasks, the cluster collects and reduces the information to form an appropriate result, and sends information technology dorsum to the Hadoop server.

MapReduce Algorithm

Inputs and Outputs (Java Perspective)

The MapReduce framework operates on <central, value> pairs, that is, the framework views the input to the job as a set of <fundamental, value> pairs and produces a set of <primal, value> pairs every bit the output of the job, conceivably of different types.

The cardinal and the value classes should be in serialized manner by the framework and hence, need to implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).

	Input	Output
Map	<k1, v1>	listing (<k2, v2>)
Reduce	<k2, list(v2)>	list (<k3, v3>)

Terminology

PayLoad − Applications implement the Map and the Reduce functions, and form the core of the job.
Mapper − Mapper maps the input key/value pairs to a set of intermediate cardinal/value pair.
NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
DataNode − Node where data is presented in advance before whatsoever processing takes place.
MasterNode − Node where JobTracker runs and which accepts job requests from clients.
SlaveNode − Node where Map and Reduce program runs.
JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker − Tracks the task and reports status to JobTracker.
Job − A program is an execution of a Mapper and Reducer beyond a dataset.
Task − An execution of a Mapper or a Reducer on a slice of data.
Task Try − A detail instance of an attempt to execute a task on a SlaveNode.

Example Scenario

Given beneath is the data regarding the electrical consumption of an organisation. It contains the monthly electrical consumption and the annual average for various years.

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Avg
1979	23	23	ii	43	24	25	26	26	26	26	25	26	25
1980	26	27	28	28	28	thirty	31	31	31	30	30	30	29
1981	31	32	32	32	33	34	35	36	36	34	34	34	34
1984	39	38	39	39	39	41	42	43	twoscore	39	38	38	40
1985	38	39	39	39	39	41	41	41	00	40	39	39	45

If the above information is given equally input, we have to write applications to procedure information technology and produce results such as finding the year of maximum usage, year of minimum usage, then on. This is a walkover for the programmers with finite number of records. They will simply write the logic to produce the required output, and pass the data to the awarding written.

But, call up of the data representing the electric consumption of all the largescale industries of a detail country, since its germination.

When nosotros write applications to process such majority data,

They volition take a lot of time to execute.
There will exist a heavy network traffic when nosotros motion data from source to network server and then on.

To solve these problems, nosotros have the MapReduce framework.

Input Data

The in a higher place information is saved as sample.txtand given equally input. The input file looks every bit shown below.

1979   23   23   2   43   24   25   26   26   26   26   25   26  25  1980   26   27   28  28   28   30   31   31   31   30   thirty   xxx  29  1981   31   32   32  32   33   34   35   36   36   34   34   34  34  1984   39   38   39  39   39   41   42   43   40   39   38   38  40  1985   38   39   39  39   39   41   41   41   00   40   39   39  45

Case Program

Given below is the program to the sample data using MapReduce framework.

package hadoop;   import java.util.*;   import java.io.IOException;  import java.io.IOException;   import org.apache.hadoop.fs.Path;  import org.apache.hadoop.conf.*;  import org.apache.hadoop.io.*;  import org.apache.hadoop.mapred.*;  import org.apache.hadoop.util.*;   public class ProcessUnits {    //Mapper class     public static form E_EMapper extends MapReduceBase implements     Mapper<LongWritable ,/*Input key Blazon */     Text,                /*Input value Type*/     Text,                /*Output key Type*/     IntWritable>        /*Output value Type*/     {       //Map role        public void map(LongWritable key, Text value,        OutputCollector<Text, IntWritable> output,                 Reporter reporter) throws IOException {           String line = value.toString();           String lasttoken = null;           StringTokenizer s = new StringTokenizer(line,"\t");           String year = southward.nextToken();                     while(due south.hasMoreTokens()) {             lasttoken = s.nextToken();          }          int avgprice = Integer.parseInt(lasttoken);           output.collect(new Text(year), new IntWritable(avgprice));        }     }        //Reducer class     public static form E_EReduce extends MapReduceBase implements Reducer< Text, IntWritable, Text, IntWritable > {           //Reduce role        public void reduce( Text cardinal, Iterator <IntWritable> values,        OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {           int maxavg = xxx;           int val = Integer.MIN_VALUE;                        while (values.hasNext()) {              if((val = values.next().get())>maxavg) {                 output.collect(key, new IntWritable(val));              }           }       }     }     //Main function     public static void primary(String args[])throws Exception {        JobConf conf = new JobConf(ProcessUnits.class);               conf.setJobName("max_eletricityunits");        conf.setOutputKeyClass(Text.class);       conf.setOutputValueClass(IntWritable.class);        conf.setMapperClass(E_EMapper.class);        conf.setCombinerClass(E_EReduce.class);        conf.setReducerClass(E_EReduce.form);        conf.setInputFormat(TextInputFormat.class);        conf.setOutputFormat(TextOutputFormat.class);               FileInputFormat.setInputPaths(conf, new Path(args[0]));        FileOutputFormat.setOutputPath(conf, new Path(args[1]));               JobClient.runJob(conf);     }  }

Salve the above program as ProcessUnits.java. The compilation and execution of the program is explained below.

Compilation and Execution of Process Units Program

Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).

Follow the steps given beneath to compile and execute the above plan.

Footstep 1

The following command is to create a directory to store the compiled java classes.

$ mkdir units

Stride ii

Download Hadoop-core-1.2.ane.jar, which is used to compile and execute the MapReduce program. Visit the following link mvnrepository.com to download the jar. Let us assume the downloaded binder is /abode/hadoop/.

Step 3

The following commands are used for compiling the ProcessUnits.java program and creating a jar for the plan.

$ javac -classpath hadoop-cadre-1.2.1.jar -d units ProcessUnits.java  $ jar -cvf units.jar -C units/ .

Step 4

The following control is used to create an input directory in HDFS.

$HADOOP_HOME/bin/hadoop fs -mkdir input_dir

Footstep v

The following control is used to re-create the input file named sample.txtin the input directory of HDFS.

$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir

Step half dozen

The following command is used to verify the files in the input directory.

$HADOOP_HOME/bin/hadoop fs -ls input_dir/

Step 7

The following command is used to run the Eleunit_max application by taking the input files from the input directory.

$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir

Wait for a while until the file is executed. Subsequently execution, as shown beneath, the output will incorporate the number of input splits, the number of Map tasks, the number of reducer tasks, etc.

INFO mapreduce.Chore: Job job_1414748220717_0002  completed successfully  14/10/31 06:02:52  INFO mapreduce.Task: Counters: 49     File System Counters    FILE: Number of bytes read = 61  FILE: Number of bytes written = 279400  FILE: Number of read operations = 0  FILE: Number of large read operations = 0    FILE: Number of write operations = 0  HDFS: Number of bytes read = 546  HDFS: Number of bytes written = twoscore  HDFS: Number of read operations = 9  HDFS: Number of big read operations = 0  HDFS: Number of write operations = 2 Job Counters       Launched map tasks = 2      Launched reduce tasks = 1     Data-local map tasks = 2      Total time spent by all maps in occupied slots (ms) = 146137     Total fourth dimension spent by all reduces in occupied slots (ms) = 441       Total time spent past all map tasks (ms) = 14613     Total time spent by all reduce tasks (ms) = 44120     Total vcore-seconds taken by all map tasks = 146137     Total vcore-seconds taken by all reduce tasks = 44120     Full megabyte-seconds taken past all map tasks = 149644288     Total megabyte-seconds taken by all reduce tasks = 45178880      Map-Reduce Framework       Map input records = 5      Map output records = five       Map output bytes = 45      Map output materialized bytes = 67      Input split bytes = 208     Combine input records = five      Combine output records = v     Reduce input groups = 5      Reduce shuffle bytes = 6      Reduce input records = 5      Reduce output records = 5      Spilled Records = 10      Shuffled Maps  = 2      Failed Shuffles = 0      Merged Map outputs = 2      GC time elapsed (ms) = 948      CPU time spent (ms) = 5160      Physical memory (bytes) snapshot = 47749120      Virtual memory (bytes) snapshot = 2899349504      Total committed heap usage (bytes) = 277684224       File Output Format Counters       Bytes Written = 40

Step 8

The post-obit command is used to verify the resultant files in the output folder.

$HADOOP_HOME/bin/hadoop fs -ls output_dir/

Step nine

The following command is used to run across the output in Role-00000 file. This file is generated by HDFS.

$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000

Below is the output generated past the MapReduce plan.

1981    34  1984    twoscore  1985    45

Stride 10

The following command is used to copy the output folder from HDFS to the local file organisation for analyzing.

$HADOOP_HOME/bin/hadoop fs -true cat output_dir/role-00000/bin/hadoop dfs get output_dir /abode/hadoop

Important Commands

All Hadoop commands are invoked past the $HADOOP_HOME/bin/hadoop command. Running the Hadoop script without any arguments prints the description for all commands.

Usage − hadoop [--config confdir] COMMAND

The following table lists the options available and their description.

Sr.No.	Option & Clarification
1	namenode -format Formats the DFS filesystem.
2	secondarynamenode Runs the DFS secondary namenode.
three	namenode Runs the DFS namenode.
4	datanode Runs a DFS datanode.
v	dfsadmin Runs a DFS admin client.
6	mradmin Runs a Map-Reduce admin client.
7	fsck Runs a DFS filesystem checking utility.
8	fs Runs a generic filesystem user client.
9	balancer Runs a cluster balancing utility.
ten	oiv Applies the offline fsimage viewer to an fsimage.
xi	fetchdt Fetches a delegation token from the NameNode.
12	jobtracker Runs the MapReduce job Tracker node.
xiii	pipes Runs a Pipes job.
fourteen	tasktracker Runs a MapReduce task Tracker node.
fifteen	historyserver Runs job history servers as a standalone daemon.
xvi	job Manipulates the MapReduce jobs.
17	queue Gets information regarding JobQueues.
18	version Prints the version.
19	jar <jar> Runs a jar file.
twenty	distcp <srcurl> <desturl> Copies file or directories recursively.
21	distcp2 <srcurl> <desturl> DistCp version 2.
22	*archive -archiveName Name -p <parent path> <src> <dest>** Creates a hadoop archive.
23	classpath Prints the form path needed to get the Hadoop jar and the required libraries.
24	daemonlog Go/Set the log level for each daemon

How to Interact with MapReduce Jobs

Usage − hadoop job [GENERIC_OPTIONS]

The following are the Generic Options bachelor in a Hadoop job.

Sr.No.	GENERIC_OPTION & Description
1	-submit <job-file> Submits the task.
two	-status <job-id> Prints the map and reduce completion percentage and all job counters.
iii	-counter <job-id> <group-proper name> <countername> Prints the counter value.
iv	-kill <job-id> Kills the job.
v	-events <task-id> <fromevent-#> <#-of-events> Prints the events' details received by jobtracker for the given range.
6	-history [all] <jobOutputDir> - history < jobOutputDir> Prints job details, failed and killed tip details. More details about the job such as successful tasks and task attempts made for each chore can be viewed by specifying the [all] pick.
seven	-list[all] Displays all jobs. -list displays only jobs which are yet to complete.
eight	-kill-task <task-id> Kills the task. Killed tasks are NOT counted against failed attempts.
9	-fail-task <task-id> Fails the task. Failed tasks are counted against failed attempts.
x	-set up-priority <job-id> <priority> Changes the priority of the chore. Immune priority values are VERY_HIGH, Loftier, NORMAL, LOW, VERY_LOW

To see the status of chore

$ $HADOOP_HOME/bin/hadoop chore -status <JOB-ID>  east.g.  $ $HADOOP_HOME/bin/hadoop chore -condition job_201310191043_0004

To see the history of job output-dir

$ $HADOOP_HOME/bin/hadoop chore -history <DIR-NAME>  e.g.  $ $HADOOP_HOME/bin/hadoop job -history /user/expert/output

To kill the task

$ $HADOOP_HOME/bin/hadoop chore -kill <JOB-ID>  eastward.g.  $ $HADOOP_HOME/bin/hadoop job -kill job_201310191043_0004

Useful Video Courses

Video

tindalbitterephe.blogspot.com

Source: https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm