HDFS—故障排除
1. NameNode故障处理
NameNode进程挂了并且存储的数据也丢失了,如何恢复NameNode?
1.1 故障模拟
- kill -9 NameNode进程
[jack@hadoop102 current]$ kill -9 19886
- 删除NameNode存储的数据(/opt/module/hadoop-3.3.6/data/tmp/dfs/name)
[jack@hadoop102 hadoop-3.3.6]$ rm -rf /opt/module/hadoop-3.3.6/data/dfs/name/*
1.2 问题解决
- 拷贝SecondaryNameNode中数据到原NameNode存储数据目录
[jack@hadoop102 dfs]$ scp -r jack@hadoop104:/opt/module/hadoop-3.3.6/data/dfs/namesecondary/* ./name/
- 重新启动NameNode
[jack@hadoop102 hadoop-3.3.6]$ hdfs --daemon start namenode
2. 集群安全模式&磁盘修复
2.1 安全模式简介
安全模式:文件系统只接受读数据请求,而不接受删除、修改等变更请求。
2.2 进入安全模式场景
- NameNode在加载镜像文件和编辑日志期间处于安全模式
- NameNode再接收DataNode注册时,处于安全模式
2.3 退出安全模式条件
- dfs.namenode.safemode.min.datanodes:最小可用datanode数量,默认0
- dfs.namenode.safemode.threshold-pct:副本数达到最小要求的block占系统总block数的百分比,默认0.999f。(只允许丢一个块)
- dfs.namenode.safemode.extension:稳定时间,默认值30000毫秒,即30秒
2.4 基本语法
集群处于安全模式,不能执行重要操作(写操作)。集群启动完成后,自动退出安全模式。
(1)bin/hdfs dfsadmin -safemode get (功能描述:查看安全模式状态)
(2)bin/hdfs dfsadmin -safemode enter (功能描述:进入安全模式状态)
(3)bin/hdfs dfsadmin -safemode leave (功能描述:离开安全模式状态)
(4)bin/hdfs dfsadmin -safemode wait (功能描述:等待安全模式状态)
2.5 启动集群进入安全模式
- 重新启动集群
[jack@hadoop102 subdir0]$ hadoop_helper stop
[jack@hadoop102 subdir0]$ hadoop_helper start
- 集群启动后,立即来到集群上删除数据,提示集群处于安全模式
2.6 模仿磁盘损坏
需求:数据块损坏,进入安全模式,如何处理?
- 分别进入hadoop102、hadoop103、hadoop104的/opt/module/hadoop-3.3.6/data/dfs/data/current/BP-1056434405-192.168.101.102-1705325179904/current/finalized/subdir0/subdir0目录,统一删除某2个块信息
[jack@hadoop102 subdir0]$ pwd
/opt/module/hadoop-3.3.6/data/dfs/data/current/BP-1056434405-192.168.101.102-1705325179904/current/finalized/subdir0/subdir0
[jack@hadoop102 subdir0]$ rm -rf blk_1073741830 blk_1073741830_1006.meta blk_1073741832 blk_1073741832_1008.meta
说明:hadoop103/hadoop104需要重复执行以上命令。
提示
可以使用Xshell的多个会话功能, 在hadoop102/hadoop103/hadoop104上面同时执行删除block命令
- 重新启动集群
[jack@hadoop102 subdir0]$ hadoop_helper stop
[jack@hadoop102 subdir0]$ hadoop_helper start
- 观察http://hadoop102:9870/dfshealth.html#tab-overview
说明:安全模式已经打开,块的数量没有达到要求。
- 离开安全模式
[jack@hadoop102 subdir0]$ hdfs dfsadmin -safemode get
Safe mode is ON
[jack@hadoop102 subdir0]$ hdfs dfsadmin -safemode leave
Safe mode is OFF
- 观察http://hadoop102:9870/dfshealth.html#tab-overview
发现安全模式已经关闭,但是页面还是警告丢失文件。有两种办法处理:关机联系数据公司进行磁盘数据恢复提示的文件内容或者删除有问题的数据块元数据信息。
- 将丢失的元数据删除
- 观察http://hadoop102:9870/dfshealth.html#tab-overview, 集群已经正常
2.7 使用等待安全模式
- 查看当前模式
[jack@hadoop102 subdir0]$ hdfs dfsadmin -safemode get
Safe mode is OFF
``
2. 进入安全模式
```sh
[jack@hadoop102 subdir0]$ hdfs dfsadmin -safemode enter
Safe mode is ON
- 创建并执行safemode_test.sh脚本
[jack@hadoop102 hadoop-3.3.6]$ vim safemode_test.sh
#!/bin/bash
hdfs dfsadmin -safemode wait
hdfs dfs -put /tmp/safemode_test.sh /
echo "finished 添加文件!"
[jack@hadoop102 hadoop-3.3.6]$ chmod 777 safemode_test.sh
[jack@hadoop102 hadoop-3.3.6]$ ./safemode_test.sh
会发现脚本会卡住,执行在wait停住。 4. 新开shell窗口,执行关闭安全模式
[jack@hadoop102 hadoop-3.3.6]$ bin/hdfs dfsadmin -safemode leave
- 切换到上一个窗口再观察
[jack@hadoop103 tmp]$ ./safemode_test.sh
Safe mode is OFF
finished 添加文件!
- HDFS集群上已经有上传的数据了
3. 慢磁盘监控
"慢磁盘"指的时写入数据非常慢的一类磁盘。其实慢性磁盘并不少见,当机器运行时间长了,上面跑的任务多了,磁盘的读写性能自然会退化,严重时就会出现写入数据延时的问题。
如何发现慢磁盘?
正常在HDFS上创建一个目录,只需要不到1s的时间。如果你发现创建目录超过1分钟及以上,而且这个现象并不是每次都有。只是偶尔慢了一下,就很有可能存在慢磁盘。
可以采用如下方法找出是哪块磁盘慢:
3.1 通过心跳未联系时间
一般出现慢磁盘现象,会影响到DataNode与NameNode之间的心跳。正常情况心跳时间间隔是3s。超过3s说明有异常。
3.2 fio命令,测试磁盘的读写性能
- 安装fio
[jack@hadoop103 tmp]$ sudo yum install -y fio
- 顺序读测试
[jack@hadoop103 tmp]$ sudo fio -filename=/tmp/test.log -direct=1 -iodepth 1 -thread -rw=read -ioengine=psync -bs=16k -size=2G -numjobs=10 -runtime=60 -group_reporting -name=test_r
test_r: (g=0): rw=read, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=1
...
fio-3.7
Starting 10 threads
test_r: Laying out IO file (1 file / 2048MiB)
Jobs: 10 (f=10): [R(10)][100.0%][r=213MiB/s,w=0KiB/s][r=13.7k,w=0 IOPS][eta 00m:00s]
test_r: (groupid=0, jobs=10): err= 0: pid=30801: Wed Jan 24 14:21:23 2024
read: IOPS=13.8k, BW=216MiB/s (226MB/s)(12.6GiB/60001msec)
clat (usec): min=60, max=18020, avg=722.53, stdev=385.06
lat (usec): min=60, max=18021, avg=722.74, stdev=385.06
clat percentiles (usec):
| 1.00th=[ 70], 5.00th=[ 73], 10.00th=[ 474], 20.00th=[ 570],
| 30.00th=[ 627], 40.00th=[ 652], 50.00th=[ 701], 60.00th=[ 717],
| 70.00th=[ 766], 80.00th=[ 824], 90.00th=[ 1020], 95.00th=[ 1418],
| 99.00th=[ 1811], 99.50th=[ 2147], 99.90th=[ 3097], 99.95th=[ 4359],
| 99.99th=[11994]
bw ( KiB/s): min=19360, max=22880, per=9.99%, avg=22073.63, stdev=580.68, samples=1194
iops : min= 1210, max= 1430, avg=1379.57, stdev=36.30, samples=1194
lat (usec) : 100=8.13%, 250=0.36%, 500=3.51%, 750=53.12%, 1000=24.57%
lat (msec) : 2=9.59%, 4=0.66%, 10=0.03%, 20=0.02%
cpu : usr=0.09%, sys=9.71%, ctx=758662, majf=0, minf=43
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=828243,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=216MiB/s (226MB/s), 216MiB/s-216MiB/s (226MB/s-226MB/s), io=12.6GiB (13.6GB), run=60001-60001msec
Disk stats (read/write):
sda: ios=826772/10, merge=0/0, ticks=64752/0, in_queue=64712, util=100.00%
结果显示,磁盘的总体顺序读速度为216MiB/s。
2. 顺序写测试
[jack@hadoop102 tmp]$ sudo fio -filename=/tmp/test.log -direct=1 -iodepth 1 -thread -rw=write -ioengine=psync -bs=16k -size=2G -numjobs=10 -runtime=60 -group_reporting -name=test_wtest_w: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=1
...
fio-3.7
Starting 10 threads
test_w: Laying out IO file (1 file / 2048MiB)
Jobs: 10 (f=10): [W(10)][100.0%][r=0KiB/s,w=185MiB/s][r=0,w=11.9k IOPS][eta 00m:00s]
test_w: (groupid=0, jobs=10): err= 0: pid=43511: Wed Jan 24 14:23:20 2024
write: IOPS=12.8k, BW=199MiB/s (209MB/s)(11.7GiB/60011msec)
clat (usec): min=61, max=313342, avg=781.75, stdev=696.82
lat (usec): min=62, max=313343, avg=782.12, stdev=696.83
clat percentiles (usec):
| 1.00th=[ 72], 5.00th=[ 78], 10.00th=[ 97], 20.00th=[ 545],
| 30.00th=[ 619], 40.00th=[ 668], 50.00th=[ 734], 60.00th=[ 783],
| 70.00th=[ 832], 80.00th=[ 922], 90.00th=[ 1287], 95.00th=[ 1631],
| 99.00th=[ 2376], 99.50th=[ 2933], 99.90th=[ 6915], 99.95th=[11469],
| 99.99th=[16188]
bw ( KiB/s): min= 9760, max=21280, per=10.00%, avg=20404.58, stdev=886.73, samples=1200
iops : min= 610, max= 1330, avg=1275.22, stdev=55.45, samples=1200
lat (usec) : 100=10.13%, 250=1.06%, 500=4.53%, 750=39.63%, 1000=29.22%
lat (msec) : 2=13.22%, 4=1.99%, 10=0.14%, 20=0.06%, 50=0.01%
lat (msec) : 500=0.01%
cpu : usr=0.07%, sys=9.76%, ctx=680639, majf=0, minf=7
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,765507,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=199MiB/s (209MB/s), 199MiB/s-199MiB/s (209MB/s-209MB/s), io=11.7GiB (12.5GB), run=60011-60011msec
Disk stats (read/write):
sda: ios=0/764318, merge=0/1, ticks=0/64761, in_queue=64708, util=100.00%
结果显示,磁盘的总体顺序写速度为199MiB/s。
3. 随机写测试
[jack@hadoop103 tmp]$ sudo fio -filename=/tmp/test2.log -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=psync -bs=16k -size=2G -numjobs=10 -runtime=60 -group_reporting -name=test_randw
test_randw: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=1
...
fio-3.7
Starting 10 threads
test_randw: Laying out IO file (1 file / 2048MiB)
Jobs: 10 (f=10): [w(10)][100.0%][r=0KiB/s,w=166MiB/s][r=0,w=10.6k IOPS][eta 00m:00s]
test_randw: (groupid=0, jobs=10): err= 0: pid=30817: Wed Jan 24 14:24:39 2024
write: IOPS=11.2k, BW=175MiB/s (183MB/s)(10.2GiB/60001msec)
clat (usec): min=61, max=301012, avg=891.31, stdev=1386.69
lat (usec): min=62, max=301012, avg=891.69, stdev=1386.70
clat percentiles (usec):
| 1.00th=[ 72], 5.00th=[ 76], 10.00th=[ 82], 20.00th=[ 103],
| 30.00th=[ 519], 40.00th=[ 660], 50.00th=[ 766], 60.00th=[ 865],
| 70.00th=[ 988], 80.00th=[ 1188], 90.00th=[ 1663], 95.00th=[ 2147],
| 99.00th=[ 4113], 99.50th=[ 6259], 99.90th=[16319], 99.95th=[22676],
| 99.99th=[40633]
bw ( KiB/s): min= 8032, max=22146, per=10.00%, avg=17893.63, stdev=1769.94, samples=1197
iops : min= 502, max= 1384, avg=1118.27, stdev=110.62, samples=1197
lat (usec) : 100=19.11%, 250=5.31%, 500=4.89%, 750=19.07%, 1000=22.69%
lat (msec) : 2=22.90%, 4=4.97%, 10=0.79%, 20=0.19%, 50=0.06%
lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%
cpu : usr=0.09%, sys=9.67%, ctx=544128, majf=0, minf=7
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,671162,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=175MiB/s (183MB/s), 175MiB/s-175MiB/s (183MB/s-183MB/s), io=10.2GiB (10.0GB), run=60001-60001msec
Disk stats (read/write):
sda: ios=0/669952, merge=0/1, ticks=0/63179, in_queue=63115, util=100.00%
结果显示,磁盘的总体随机写速度为175MiB/s。
4. 混合随机读写
[jack@hadoop104 subdir0]$ sudo fio -filename=/tmp/test3.log -direct=1 -iodepth 1 -thread -rw=randrw -rwmixread=70 -ioengine=psync -bs=16k -size=2G -numjobs=10 -runtime=60 -group_reporting -name=test_r_w -ioscheduler=noop
test_r_w: (g=0): rw=randrw, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=1
...
fio-3.7
Starting 10 threads
test_r_w: Laying out IO file (1 file / 2048MiB)
Jobs: 10 (f=10): [m(10)][100.0%][r=129MiB/s,w=53.3MiB/s][r=8250,w=3412 IOPS][eta 00m:00s]
test_r_w: (groupid=0, jobs=10): err= 0: pid=46934: Wed Jan 24 14:25:11 2024
read: IOPS=8611, BW=135MiB/s (141MB/s)(8074MiB/60002msec)
clat (usec): min=61, max=30085, avg=785.75, stdev=491.36
lat (usec): min=61, max=30085, avg=785.96, stdev=491.37
clat percentiles (usec):
| 1.00th=[ 74], 5.00th=[ 78], 10.00th=[ 229], 20.00th=[ 537],
| 30.00th=[ 619], 40.00th=[ 685], 50.00th=[ 742], 60.00th=[ 791],
| 70.00th=[ 857], 80.00th=[ 955], 90.00th=[ 1352], 95.00th=[ 1614],
| 99.00th=[ 2245], 99.50th=[ 2573], 99.90th=[ 3982], 99.95th=[ 6325],
| 99.99th=[12256]
bw ( KiB/s): min=11520, max=15456, per=9.99%, avg=13768.25, stdev=622.38, samples=1195
iops : min= 720, max= 966, avg=860.48, stdev=38.91, samples=1195
write: IOPS=3694, BW=57.7MiB/s (60.5MB/s)(3464MiB/60002msec)
clat (usec): min=63, max=15618, avg=866.82, stdev=536.49
lat (usec): min=64, max=15627, avg=867.22, stdev=536.50
clat percentiles (usec):
| 1.00th=[ 75], 5.00th=[ 79], 10.00th=[ 88], 20.00th=[ 570],
| 30.00th=[ 725], 40.00th=[ 807], 50.00th=[ 873], 60.00th=[ 938],
| 70.00th=[ 1012], 80.00th=[ 1123], 90.00th=[ 1450], 95.00th=[ 1745],
| 99.00th=[ 2311], 99.50th=[ 2606], 99.90th=[ 3752], 99.95th=[ 5997],
| 99.99th=[12256]
bw ( KiB/s): min= 4192, max= 7104, per=9.99%, avg=5907.91, stdev=387.16, samples=1195
iops : min= 262, max= 444, avg=369.21, stdev=24.20, samples=1195
lat (usec) : 100=10.66%, 250=1.48%, 500=5.26%, 750=28.26%, 1000=32.40%
lat (msec) : 2=20.10%, 4=1.75%, 10=0.06%, 20=0.03%, 50=0.01%
cpu : usr=0.09%, sys=9.75%, ctx=650202, majf=0, minf=7
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=516717,221705,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=135MiB/s (141MB/s), 135MiB/s-135MiB/s (141MB/s-141MB/s), io=8074MiB (8466MB), run=60002-60002msec
WRITE: bw=57.7MiB/s (60.5MB/s), 57.7MiB/s-57.7MiB/s (60.5MB/s-60.5MB/s), io=3464MiB (3632MB), run=60002-60002msec
Disk stats (read/write):
sda: ios=515724/221264, merge=0/0, ticks=44809/19817, in_queue=64587, util=100.00%
结果显示,磁盘的总体混合随机读写,读速度为135MiB/s,写速度57.7MiB/s。
4. 小文件归档
4.1 HDFS存储小文件弊端
每个文件均按块存储,每个块的元数据存储在NameNode的内存中,因此HDFS存储小文件会非常低效。因为大量的小文件会耗尽NameNode中的大部分内存。但注意,存储小文件所需要的磁盘容量和数据块的大小无关。例如,一个1MB的文件设置为128MB的块存储,实际使用的是1MB的磁盘空间,而不是128MB。
4.2 解决存储小文件办法
HDFS存档文件或HAR文件,是一个更高效的文件存档工具,它将文件存入HDFS块,在减少NameNode内存使用的同时,允许对文件进行透明的访问。具体说来,HDFS存档文件对内还是一个一个独立文件,对NameNode而言却是一个整体,减少了NameNode的内存占用。
- 归档文件 把/input目录里面的所有文件归档成一个叫input.har的归档文件,并把归档后文件存储到/output路径下
[jack@hadoop104 subdir0]$ hadoop archive -archiveName input.har -p /user/jack/input /output
2024-01-24 14:44:38,203 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at hadoop103/192.168.101.103:8032
2024-01-24 14:44:39,158 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at hadoop103/192.168.101.103:8032
2024-01-24 14:44:39,214 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at hadoop103/192.168.101.103:8032
2024-01-24 14:44:39,772 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/jack/.staging/job_1706075314860_0001
2024-01-24 14:44:40,363 INFO mapreduce.JobSubmitter: number of splits:1
2024-01-24 14:44:40,797 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1706075314860_0001
2024-01-24 14:44:40,797 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-01-24 14:44:41,085 INFO conf.Configuration: resource-types.xml not found
2024-01-24 14:44:41,102 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-01-24 14:44:41,621 INFO impl.YarnClientImpl: Submitted application application_1706075314860_0001
2024-01-24 14:44:41,749 INFO mapreduce.Job: The url to track the job: http://hadoop103:8088/proxy/application_1706075314860_0001/
2024-01-24 14:44:41,754 INFO mapreduce.Job: Running job: job_1706075314860_0001
2024-01-24 14:44:59,341 INFO mapreduce.Job: Job job_1706075314860_0001 running in uber mode : false
2024-01-24 14:44:59,343 INFO mapreduce.Job: map 0% reduce 0%
2024-01-24 14:45:11,640 INFO mapreduce.Job: map 100% reduce 0%
2024-01-24 14:45:21,001 INFO mapreduce.Job: map 100% reduce 100%
2024-01-24 14:45:22,030 INFO mapreduce.Job: Job job_1706075314860_0001 completed successfully
....
执行完成后,查看页面: 2. 查看归档
## 命令行查看小文件信息
[jack@hadoop102 hadoop-3.3.6]$ hadoop fs -ls /output/input.har
## 看起来和归档之前不直观,可以使用har:///协议查看小文件信息
[jack@hadoop104 subdir0]$ hadoop fs -ls har:///output/input.har
Found 4 items
drwxr-xr-x - jack supergroup 0 2024-01-16 22:29 har:///output/input.har/json
drwxr-xr-x - jack supergroup 0 2024-01-16 22:28 har:///output/input.har/log
drwxr-xr-x - jack supergroup 0 2024-01-16 23:27 har:///output/input.har/story
drwxr-xr-x - jack supergroup 0 2024-01-18 15:07 har:///output/input.har/txt
## 复制小文件到/tmp目录下
[jack@hadoop104 subdir0]$ hadoop fs -cp har:///output/input.har/story /tmp/
查看复制结果:
- 解归档文件
[jack@hadoop102 hadoop-3.3.6]$ hadoop fs -cp har:///output/input.har/* /