数仓培训第八天

1.Hadoop介绍

广义: 以Apache hadoop软件为主的生态圈 也包含 hive sqoop hbase kafka spark flink
狭义: Apache hadoop软件

Apache hadoop软件:版本1.x,2.x,3.x
主流2.x版本
hdfs 存储 -数据存储 –hadoop分布式文件系统
mapreduce 计算(作业) -计算分析 map reduce两个函数
yarn 资源和作业调度 -控制计算存储资源

大数据平台: 存储是第一位 ;存储和计算是相辅相成的

官网:
hadoop.apache.org
hive.apache.org

网址格式:
组件.apache.org

APACHE版本:
2.x
3.x

生产上至今企业还是以CDH5.x为主 (cloudera公司)
hadoop-2.6.0-cdh5.16.2.tar.gz

2.hadoop hdfs安装

2.1 创建用户和文件夹

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@hadoop001 ~]# useradd hadoop
[root@hadoop001 ~]# su - hadoop
[hadoop@hadoop001 ~]$ mkdir tmp sourcecode software shell log lib data app
tmp 存放临时文件 pid等文件
app
sourcecode源码编译
software上传资源包
shell脚本
log日志
lib第三方架包
data存储第三方数据
[hadoop@hadoop001 ~]$ cd software/
[hadoop@hadoop001 software]$ ll
total 605476
-rw-r--r--. 1 root root 434354462 May 7 09:54 hadoop-2.6.0-cdh5.16.2.tar.gz
-rw-r--r--. 1 hadoop hadoop 185646832 May 7 11:04 jdk-8u181-linux-x64.tar.gz

2.2 安装部署jdk

步骤参考:
https://wangqi1994.github.io/2020/04/26/JDK部署/

生产注意点:
利用cdh的话,jdk在生产上有要求,否则有bug
生产上一般使用jdk8 下载官方推荐的版本8u181
apache上说7以上就可以
建议8,以cloudra公司为准,避免踩坑

apache官网java版本要求
cloudra官网java版本要求

cds spark
cdk kafka

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[hadoop@hadoop001 software]$ which java
/usr/java/jdk1.8.0_45/bin/java
#切回root用户,防止权限问题
[root@hadoop001 ~]# cd /home/hadoop/software/
[root@hadoop001 software]# tar -xzvf jdk-8u181-linux-x64.tar.gz -C /usr/java
#查看用户权限并修改用户权限
[root@hadoop001 java]# ll
total 8
drwxr-xr-x. 7 10 143 4096 Jul 7 2018 jdk1.8.0_181
drwxr-xr-x. 8 root root 4096 Apr 11 2015 jdk1.8.0_45
[root@hadoop001 java]# chown -R root:root jdk1.8.0_181
[root@hadoop001 java]# ll
total 8
drwxr-xr-x. 7 root root 4096 Jul 7 2018 jdk1.8.0_181
drwxr-xr-x. 8 root root 4096 Apr 11 2015 jdk1.8.0_45
#配置环境变量
[root@hadoop001 java]# vi /etc/profile
# ruozedata environment
export JVA_HOME=/usr/java/jdk1.8.0_181
export PATH=$JVA_HOME/bin:$PATH
#source生效环境变量
[root@hadoop001 java]# source /etc/profile
#测试
[root@hadoop001 java]# which java
/usr/java/jdk1.8.0_181/bin/java
[root@hadoop001 java]# java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

2.3 hadoop解压和软连接

1
2
3
4
5
6
7
8
9
10
[hadoop@hadoop001 software]$ tar -xzvf hadoop-2.6.0-cdh5.16.2.tar.gz -C ../app/
[hadoop@hadoop001 software]$ cd ../app
[hadoop@hadoop001 app]$ ll
total 4
drwxr-xr-x. 14 hadoop hadoop 4096 Jun 3 2019 hadoop-2.6.0-cdh5.16.2
[hadoop@hadoop001 app]$ ln -s hadoop-2.6.0-cdh5.16.2 hadoop #软连接
[hadoop@hadoop001 app]$ ll
total 4
lrwxrwxrwx. 1 hadoop hadoop 22 May 7 11:20 hadoop -> hadoop-2.6.0-cdh5.16.2
drwxr-xr-x. 14 hadoop hadoop 4096 Jun 3 2019 hadoop-2.6.0-cdh5.16.2

软连接介绍和场景
ln -s 目标路径 软连接路径
1.版本切换 -相当于快捷方式
切换版本时,就是将原先的hadoop文件删除,在进行软连接
/home/hadoop/app/hadoop
/home/hadoop/app/hadoop-2.6.0-cdh5.16.2

如果不用软连接,想要升级替换版本 代码脚本都要仔细检查修改 2–》3
但是如果提前设置软连接,代码脚本是hadoop,不关心版本多少

2.小盘换大盘 -降低磁盘内存爆的风险
/根目录磁盘 设置的比较小 20G 如果/app/log/hadoop-hdfs 文件夹是18G
会增加内存爆炸的风险
可以通过外部挂载一个盘 如/data01
将该文件夹移动到挂载盘,再通过软连接来连接

1
2
mv /app/log/hadoop-hdfs /data01/  ==>/data01/hadoop-hdfs
ln -s /data01/hadoop-hdfs /app/log/hadoop-hdfs

2.4 显性指定环境变量

1
2
3
4
[hadoop@hadoop001 app]$ cd hadoop/etc/hadoop
[hadoop@hadoop001 hadoop]$ vi hadoop-env.sh

export JAVA_HOME=/usr/java/jdk1.8.0_181

2.5 hadoop介绍和官方文档

hadoop三种模式:
Local (Standalone) Mode 本地模式
Pseudo-Distributed Mode 伪分布式模式 - 安装官方文档做
Fully-Distributed Mode 分布式模式 集群 -之后讲

官方文档

2.6 配置ssh hadoop001 无密码认证

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[hadoop@hadoop001 ~]$ rm -rf .ssh #生产上不要随意删除某个文件,首先要确认内容,这里需要看authorized_keys 文件是否有内容,有就尽量不要删
[hadoop@hadoop001 ~]$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): # 直接回车确认
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase): #直接回车不设密码
Enter same passphrase again: #继续回车
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
d6:5f:35:4f:6c:f8:38:0b:65:ab:6d:50:3b:a6:71:91 hadoop@hadoop001
The key's randomart image is:
+--[ RSA 2048]----+
| |
| + |
| E.=|
| . + Xo|
| S . + X o|
| . . @ + |
| + + |
| . |
| |
+-----------------+
[hadoop@hadoop001 ~]$ cd .ssh
[hadoop@hadoop001 .ssh]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys #注意写上路径,相对路径无法找到authorized_keys
[hadoop@hadoop001 .ssh]$ chmod 0600 ~/.ssh/authorized_keys #非root用户刷权限
[hadoop@hadoop001 .ssh]$ ssh hadoop001 date # 确保在/etc/hosts文件中有hadoop001 机器名称,否则容易报错ssh: Could not resolve hostname hadoop001: Name or service not known
The authenticity of host 'hadoop001 (192.168.111.131)' can't be established.
ECDSA key fingerprint is 8d:08:cf:2b:40:03:aa:7a:4e:0c:60:0f:2c:19:8c:1a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'hadoop001,192.168.111.131' (ECDSA) to the list of known hosts.
Thu May 7 14:02:46 CST 2020
[hadoop@hadoop001 .ssh]$ ssh hadoop001 date
Thu May 7 14:02:51 CST 2020

2.7 修改配置

hdfs的三个进程都以hadoop001名称启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# nn启动以hadoop001名称启动
etc/hadoop/core-site.xml #核心文件,在app下hadoop文件下
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop001:9000</value> #修改为用机器名字启动
</property>

<property>
<name>hadoop.tmp.dir</name> #配置tmp目录,建议自身目录的tmp文件夹
<value>/home/hadoop/tmp/</value>
</property>

</configuration>

# snn启动以hadoop001名称启动
etc/hadoop/hdfs-site.xml #在app下hadoop文件下
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property> #复制两份
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop001:9868</value># 机器名根据自身情况设置
</property>

<property>
<name>dfs.namenode.secondary.https-address</name>
<value>hadoop001:9868</value>
</property>
</configuration>

#dn启动以hadoop001名称启动
[hadoop@hadoop001 hadoop]$ pwd
/home/hadoop/app/hadoop/etc/hadoop
[hadoop@hadoop001 hadoop]$ vi slaves
hadoop001

注意点:tmp目录一般有一个30天的删除功能,所以尽量不要把文件存储在tmp文件夹中

2.8 格式化

1
2
3
4
[hadoop@ruozedata001 hadoop]$ pwd
/home/hadoop/app/hadoop
[hadoop@ruozedata001 hadoop]$ bin/hdfs namenode -format
#只需第一次即可,格式化自己的编码存储格式,构建自己分布式文件系统的编码格式,只要做第一次

2.9 启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[hadoop@hadoop001 hadoop]$ sbin/start-dfs.sh
20/05/07 15:11:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [hadoop001]
hadoop001: starting namenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-namenode-hadoop001.out
hadoop001: starting datanode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-datanode-hadoop001.out
Starting secondary namenodes [hadoop001]
hadoop001: starting secondarynamenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-secondarynamenode-hadoop001.out
20/05/07 15:11:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 hadoop]$ jps
42944 SecondaryNameNode
43171 Jps
42357 NameNode
42555 DataNode
[hadoop@hadoop001 hadoop]$

sbin/stop-all.sh =stop_dfs和stop_yarn

2.10 open web:

1
2
3
4
5
[hadoop@hadoop001 hadoop]$ netstat -nlp |grep 42357
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 192.168.111.131:9000 0.0.0.0:* LISTEN 42357/java
tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 42357/java

根据端口和ip能打开网页说明hadoop第一个组件是ok的
http://192.168.111.131:50070/dfshealth.html#tab-overview

注意:
查看防火墙情况,要想在外部访问虚拟机,需要关闭linux防火墙

1
systemctrl stop firewalld

2.11 创建文件夹

1
2
3
4
5
6
7
8
9
10
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -mkdir /usr
20/05/07 17:38:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -ls /
20/05/07 17:39:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2020-05-07 17:38 /usr
[hadoop@hadoop001 hadoop]$ ls /
bin boot dev etc home lib lib64 lost+found media mnt opt proc root run sbin srv sys tmp usr var
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -mkdir /usr/hadoop/
20/05/07 17:42:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2.12 上传linux–>hdfs

1
2
3
4
5
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -mkdir /wordcount #创建一个文件夹
20/05/07 17:43:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -mkdir /wordcount/input #创建input文件夹,output文件夹不需要创建,运算结束会自动创建
20/05/07 17:43:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -put etc/hadoop/*.xml /wordcount/input #上传数据

2.13 计算

1
2
3
bin/hadoop jar \
share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.16.2.jar \ #根据自己文件路径和名称填写
grep /wordcount/input /wordcount/output 'dfs[a-z.]+' #input和output路径最好写全,否则是去找user目录下的文件夹

2.14 下载从hdfs–>linux

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -get /wordcount/output output #从hdfs下载到Linux对应目录下的output文件夹
20/05/07 17:57:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 hadoop]$ cat output/
cat: output/: Is a directory
[hadoop@hadoop001 hadoop]$ cd output/
[hadoop@hadoop001 output]$ ll
total 4
-rw-r--r--. 1 hadoop hadoop 90 May 7 17:57 part-r-00000
-rw-r--r--. 1 hadoop hadoop 0 May 7 17:57 _SUCCESS
[hadoop@hadoop001 output]$ cat part-r-00000
1 dfsadmin
1 dfs.replication
1 dfs.namenode.secondary.https
1 dfs.namenode.secondary.http
  • 版权声明: 本博客所有文章除特别声明外,著作权归作者所有。转载请注明出处!
  • Copyrights © 2017-2021 WANG Qi
  • 访问人数: | 浏览次数:

请我喝杯咖啡吧~

支付宝
微信