数仓培训第九天

1.YARN部署

官网文档

修改对应配置文档:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[root@hadoop001 hadoop]$ su - hadoop
[hadoop@hadoop001 hadoop]$ cd app/hadoop/etc/hadoop/
[hadoop@hadoop001 hadoop]$ mv mapred-site.xml.template mapred-site.xml
#没有mapred文件,将tempele文件复制改名
[hadoop@hadoop001 hadoop]$ vi mapred-site.xml
#该文件不要随意删除第一行数据,这是表明是一个xml文件

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

[hadoop@hadoop001 hadoop]$ vi yarn-site.xml

<configuration>

<!-- Site specific YARN configuration properties --> #这之内的是可以删除的注释
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop001:18808</value> # 主机名改为自己的机器名字。端口号修改一下,范围1-65535
</property>
</configuration>

yarn默认你端口号8088.
8088端口暴露外网,会被病毒感染 植入挖矿程序 用来计算比特币
8088内网不用怕,外网可加安全验证

Start ResourceManager daemon and NodeManager daemon: 启动yarn进程

1
2
3
4
5
6
7
8
9
10
[hadoop@hadoop001 hadoop]$ pwd
/home/hadoop/app/hadoop/etc/hadoop
[hadoop@hadoop001 hadoop]$ cd ../../
[hadoop@hadoop001 hadoop]$ pwd
/home/hadoop/app/hadoop
[hadoop@hadoop001 hadoop]$ sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/yarn-hadoop-resourcemanager-hadoop001.out
hadoop001: starting nodemanager, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/yarn-hadoop-nodemanager-hadoop001.out
[hadoop@hadoop001 hadoop]$

open web:
http://192.168.111.131:18808/cluster # ip:port/cluster

2.词频统计

提前为hdfs配置环境变量 etc/profile

1
2
export HADOOP_HOME=/home/hadoop/app/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

否则只能在/app/hadoop文件夹下使用bin/hdfs dfs -ls / 进行对应操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
[hadoop@hadoop001 data]$ hdfs dfs -mkdir  /wordcount2  
#创建input文件夹,output文件夹会自动生成,output文件不能提前存在,否则报错
[hadoop@hadoop001 data]$ hdfs dfs -mkdir /wordcount2/input/
#将所在文件夹所有数据上传
[hadoop@hadoop001 word]$ hdfs dfs -put * /wordcount2/input #-put是将linux放到hdfs中
20/05/11 16:14:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
[hadoop@hadoop001 word]$ hdfs dfs -ls /wordcount2/input
20/05/11 16:14:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
Found 2 items
-rw-r--r-- 1 hadoop supergroup 29 2020-05-11 16:14 /wordcount2/input/1.log
-rw-r--r-- 1 hadoop supergroup 47 2020-05-11 16:14 /wordcount2/input/2.log

[hadoop@hadoop001 hadoop]$ hadoop jar \
./share/hadoop/mapreduce2/hadoop-mapreduce-examples-2.6.0-cdh5.16.2.jar \
wordcount /wordcount2/input/ /wordcount2/output/

20/05/11 16:19:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
20/05/11 16:19:26 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
20/05/11 16:19:29 INFO input.FileInputFormat: Total input paths to process : 2

20/05/11 16:19:29 INFO mapreduce.JobSubmitter: number of splits:2
# 思考: mapreduce的split=map的个数 是什么影响的
#1) 文件的大小。
#2) 文件的个数。
#3) splitsize的大小。

20/05/11 16:19:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1589182412203_0001
20/05/11 16:19:30 INFO impl.YarnClientImpl: Submitted application application_1589182412203_0001
20/05/11 16:19:30 INFO mapreduce.Job: The url to track the job:
http://hadoop001:18808/proxy/application_1589182412203_0001/
20/05/11 16:19:30 INFO mapreduce.Job: Running job: job_1589182412203_0001
20/05/11 16:19:45 INFO mapreduce.Job: Job job_1589182412203_0001 running in uber mode : false
20/05/11 16:19:45 INFO mapreduce.Job: map 0% reduce 0%
20/05/11 16:19:55 INFO mapreduce.Job: map 50% reduce 0%
20/05/11 16:19:57 INFO mapreduce.Job: map 100% reduce 0%
20/05/11 16:20:02 INFO mapreduce.Job: map 100% reduce 100%
20/05/11 16:20:02 INFO mapreduce.Job: Job job_1589182412203_0001 completed successfully
20/05/11 16:20:02 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=122
FILE: Number of bytes written=429243
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=294
HDFS: Number of bytes written=43
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=17984
Total time spent by all reduces in occupied slots (ms)=4063
Total time spent by all map tasks (ms)=17984
Total time spent by all reduce tasks (ms)=4063
Total vcore-milliseconds taken by all map tasks=17984
Total vcore-milliseconds taken by all reduce tasks=4063
Total megabyte-milliseconds taken by all map tasks=18415616
Total megabyte-milliseconds taken by all reduce tasks=4160512
Map-Reduce Framework
Map input records=25
Map output records=19
Map output bytes=139
Map output materialized bytes=128
Input split bytes=218
Combine input records=19
Combine output records=12
Reduce input groups=8
Reduce shuffle bytes=128
Reduce input records=12
Reduce output records=8
Spilled Records=24
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=286
CPU time spent (ms)=1590
Physical memory (bytes) snapshot=480874496
Virtual memory (bytes) snapshot=8181796864
Total committed heap usage (bytes)=307437568
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=76
File Output Format Counters
Bytes Written=43

[hadoop@hadoop001 hadoop]$ hdfs dfs -ls /wordcount2/output/
20/05/11 16:35:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2020-05-11 16:20 /wordcount2/output/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 43 2020-05-11 16:20 /wordcount2/output/part-r-00000

# mapreduce跑完结束,文件的个数=reduce分区函数由什么决定,当前是1 由配置参数控制,使用过多的Reduce任务则意味着复杂的shuffle,并使输出文件的数量激增

[hadoop@hadoop001 hadoop]$ hdfs dfs -get /wordcount2/output/*
[hadoop@hadoop001 hadoop]$ cat part-r-00000
a 1
b 7
bb 1
c 1
qi 1
ruoze 2
shi 3
wang 3
[hadoop@hadoop001 hadoop]$

3.jps命令

Jps命令—使用详解

3.1 位置

1
2
[hadoop@hadoop001 hadoop]$ which jps
/usr/java/jdk1.8.0_181/bin/jps

3.2 使用

1
2
3
4
5
6
7
8
9
10
11
[hadoop@hadoop001 hadoop]$ jps
28256 ResourceManager
15617 FsShell
60210 FsShell
60386 Jps
27363 NameNode
59604 FsShell
27591 DataNode
28364 NodeManager
27967 SecondaryNameNode
[hadoop@hadoop001 hadoop]$

3.3 对应的进程标识文件存储在哪

/tmp/hsperfdata_hadoop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
[hadoop@hadoop001 tmp]$ cd hsperfdata_hadoop
[hadoop@hadoop001 hsperfdata_hadoop]$ ll
total 256
-rw-------. 1 hadoop hadoop 32768 May 11 16:13 15617
-rw-------. 1 hadoop hadoop 32768 May 11 17:11 27363
-rw-------. 1 hadoop hadoop 32768 May 11 17:11 27591
-rw-------. 1 hadoop hadoop 32768 May 11 17:11 27967
-rw-------. 1 hadoop hadoop 32768 May 11 17:11 28256
-rw-------. 1 hadoop hadoop 32768 May 11 17:11 28364
-rw-------. 1 hadoop hadoop 32768 May 11 16:02 59604
-rw-------. 1 hadoop hadoop 32768 May 11 16:09 60210
[hadoop@hadoop001 hsperfdata_hadoop]$ mv 27591 27591.bak
[hadoop@hadoop001 hsperfdata_hadoop]$ ps -ef |grep DataNode
hadoop 27591 1 3 15:32 ? 00:03:15 /usr/java/jdk1.8.0_181/bin/java -Dproc_datanode \
-Xmx1000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs \
-Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2 \
-Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,console \
-Djava.library.path=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/lib/native \
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true \
-Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true \
-Dhadoop.log.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs \
-Dhadoop.log.file=hadoop-hadoop-datanode-hadoop001.log \
-Dhadoop.home.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2 \
-Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,RFA \
-Djava.library.path=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/lib/native \
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server \
-Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS \
-Dhadoop.security.logger=ERROR,RFAS \
-Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode
hadoop 60662 59557 0 17:13 pts/0 00:00:00 grep --color=auto DataNode
[hadoop@hadoop001 hsperfdata_hadoop]$ jps
28256 ResourceManager
15617 FsShell
60210 FsShell
27363 NameNode
59604 FsShell
60663 Jps
28364 NodeManager
27967 SecondaryNameNode
[hadoop@hadoop001 hsperfdata_hadoop]$ #jps查不到时,看看ps -ef
[hadoop@hadoop001 hsperfdata_hadoop]$ mv 27591.bak 27591
mv: cannot stat ‘27591.bak’: No such file or directory
[hadoop@hadoop001 hsperfdata_hadoop]$ ll
total 224
-rw-------. 1 hadoop hadoop 32768 May 11 16:13 15617
-rw-------. 1 hadoop hadoop 32768 May 11 17:16 27363
-rw-------. 1 hadoop hadoop 32768 May 11 17:16 27967
-rw-------. 1 hadoop hadoop 32768 May 11 17:16 28256
-rw-------. 1 hadoop hadoop 32768 May 11 17:16 28364
-rw-------. 1 hadoop hadoop 32768 May 11 16:02 59604
-rw-------. 1 hadoop hadoop 32768 May 11 16:09 60210
# mv命令重命名文件.bak改回来之后会把文件搞没了,移动之后在移动回来无影响
[hadoop@hadoop001 hsperfdata_hadoop]$

3.4 当出现 – process information unavailable进程

不能代表是存在 或者 不存在,要当心,尤其是使用jps检测状态的
ps -ef|grep xxx 命令是去查看进程的真正手段!!!
用root用户kill进程,进程不存在,jps会显示没有进程信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[hadoop@hadoop001 hsperfdata_hadoop]$ jps
28256 ResourceManager
15617 FsShell
60210 FsShell
27363 NameNode
59604 FsShell
27591 -- process information unavailable
28364 NodeManager
60878 Jps
27967 SecondaryNameNode
[hadoop@hadoop001 hsperfdata_hadoop]$ ps -ef |grep 27591
hadoop 27591 1 3 15:32 ? 00:03:15 /usr/java/jdk1.8.0_181/bin/java \
-Dproc_datanode -Xmx1000m -Djava.net.preferIPv4Stack=true \
-Dhadoop.log.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs \
-Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2 \
-Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,console \
-Djava.library.path=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/lib/native \
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true \
-Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true \
-Dhadoop.log.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs \
-Dhadoop.log.file=hadoop-hadoop-datanode-hadoop001.log \
-Dhadoop.home.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2 \
-Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,RFA \
-Djava.library.path=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/lib/native \
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server \
-Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS \
-Dhadoop.security.logger=ERROR,RFAS \
-Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode
hadoop 60889 59557 0 17:20 pts/0 00:00:00 grep --color=auto 27591\

3.5 文件被删除,不影响进程的重启!!

jps并不会影响进程的启动和停止,但是要去看看文件是否清理干净,否则会造成误导

4.pid文件

4.1存储位置 /tmp

1
2
3
4
5
[hadoop@hadoop001 tmp]$ ll
total 7904
-rw-rw-r--. 1 hadoop hadoop 6 May 11 15:32 hadoop-hadoop-datanode.pid
-rw-rw-r--. 1 hadoop hadoop 6 May 11 15:32 hadoop-hadoop-namenode.pid
-rw-rw-r--. 1 hadoop hadoop 6 May 11 15:32 hadoop-hadoop-secondarynamenode.pid

4.2 维护进程的pid 写死的

1
2
3
[hadoop@hadoop001 tmp]$ cat hadoop-hadoop-datanode.pid 
27591
[hadoop@hadoop001 tmp]$

kill是通过找已经写死的pid的文件找到进程
kill -9 $(pgrep -f xxx) 通过xxx杀了所有

4.3 文件被删除,影响进程的重启!!

进程启动,pid文件写入进程的pid数字
进程关闭时,从pid文件读出pid数字,然后kill -9 pid

4.4 生产上 pid文件真的可以放心的丢在/tmp维护吗?

Linux的/tmp 会有30天的默认删除机制

4.5 如何修改

修改pid
在app/hadoop/etc/hadoop/hadoop-env.sh脚本
目标/home/hadoop/tmp/中

export HADOOP_PID_DIR=/home/hadoop/tmp/

修改之后最好确认一下原来部分文件是否存在,否则可能会出现问题

总结
pid文件生产不要丢在/tmp目录
要知道是影响进程的启动停止

5.块 block介绍及生产经验

HDFS 存储大文件是利好,存储小文件是损害自己的
适合存储大文件 不适合小文件,不代表不能存储小文件

例如:
一个260m 的 mv文件,上传到hdfs,会把文件切割成3块 (dfs.blocksize 134217728 =128M)
分别为(128m、128m、4m)

伪分布式节点(单节点) 1节点 副本数 dfs.replication 1 (副本数即为服务器数,备份数)
生产上HDFS集群的 DN肯定大于>=3台 dfs.replication=3

1
2
3
4
DN1          DN2   DN5  DN10
128m b1 b1 b1
128m b2 b2 b2
4m b3 b3 b3

通过设置副本数 来让文件存储在大数据HDFS平台上有容错保障

dfs.blocksize为什么由64M变为128M,为了减少分块的数量,降低nn的负担
生产上可以提高blocksize,减少块的数量,取决于你的文件能否达到size

副本数3:存储实际大小=文件大小3=2603=780m 存储空间

5个块的元数据信息维护 是不是 比3个块的元数据的信息维护要多,
维护重 累–》namenode

所以nn是不喜欢 小文件的

1亿个10kb文件 3亿个block
1亿个10kb文件压缩成1KW个文件100m 3KW的block

如何规避 小文件
1.数据传输到hdfs之前 就合并
2.数据已经在hdfs上,就定时去合并不常用文件 -冷文件-读写不太频繁,在业务低谷时,晚上合并,或者合并前三十天的数据
/2020/10/10
/2020/10/11
/2020/10/12 当前时间在用,可能有问题

例如:
20号 合并14号的文件
21号 合并15号的文件

一天卡一天的合并

经验值:

1
2
3
4
5
10m一下的文件  都是小文件

128m的就合并到120m文件

代码设置成128m,真正合并时可能会超,变成129m,就是两个块, 因为有些文件不可切割
  • 版权声明: 本博客所有文章除特别声明外,著作权归作者所有。转载请注明出处!
  • Copyrights © 2017-2021 WANG Qi
  • 访问人数: | 浏览次数:

请我喝杯咖啡吧~

支付宝
微信