跳到主要内容

生产实习03-zookeeper与Hbase基础

提示

为什么跳到记录第七天了?因为前几天教的是JAVAWEB相关的知识,在技术栈有比较所以就不记录了,同时在上一周解决了我们项目的后端,详细见github: wrm244/bigdata_depression

zookeeper

Zookeeper是一个分布式协调服务,其主要作用是为分布式系统提供可靠的协调和一致性功能。总的来说,Zookeeper通过提供一致性和可靠的分布式协调服务,帮助解决了分布式系统中的各种问题,包括数据一致性、节点故障处理、分布式任务调度等。Zookeeper与HBase之间有密切的关系,可以说Zookeeper是HBase的基础设施之一

配置ZooKeeper

进入ZooKeeper解压后的目录,找到conf子目录。复制zoo_sample.cfg文件并将复制的文件重命名为zoo.cfg。打开zoo.cfg文件,根据需要进行适当的配置更改。主要的配置项包括:dataDir(ZooKeeper数据目录)、clientPort(客户端连接端口)等。可参考如下配置:

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/opt/soft/zookeeper/zkData
dataLogDir=/opt/soft/zookeeper/zkLog
# the port at which the clients will connect
clientPort=2181
server.1=hadoop:2888:3888
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1

Hbase

HBase是一个开源的分布式列式数据库,主要用于存储和处理大规模结构化数据。HBase的主要作用是提供高可扩展性、高性能、弹性存储和实时查询能力,适用于处理大规模结构化数据的存储和分析需求。它在大数据领域广泛应用于日志分析、用户行为分析、在线交互式应用等场景。

配置HBASE

hbase-site.xmlxml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://hadoop:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>hadoop:2181</value>
</property>

<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>./tmp</value>
</property>
<property>
<name>hbase.unsafe.stream.capability.enforce</name>
<value>false</value>
</property>
</configuration>

备注

hbase.rootdir:设置Hadoop的HDFS目录 hbase.zookeeper.quorum:设置ZooKeeper的地址 hbase.cluster.distributed:若要屏蔽HBase自带的ZooKeeper并使用外部的ZooKeeper实例。

conf/hbase-env.shsh
export JAVA_HOME=/usr/local/jdk

启动停止HBase服务

  • 启动HBase服务,打开终端或命令提示符,进入HBase的安装目录。使用以下命令启动HBase服务。
bin/start-hbase.sh
  • 使用以下命令停止HBase服务:
bin/stop-hbase.sh
  • 使用HBase Shell 打开终端并进入HBase shell
hbase shell

HBase案例

使用HBase提供的importtsv工具来导入CSV文件。这个工具可以将CSV文件转换为HBase的KeyValue格式并导入到指定的表中。

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns=HBASE_ROW_KEY,info:dt,info:AverageTemperature,info:uncertainty,info:state,info:country stateTemperatures hdfs://node1:8020/weather/stateTemperatures.csv

扫描查看导入进去的数据

[hadoop@node1 ~]$ hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.5.5, r7ebd4381261fefd78fc2acf258a95184f4147cee, Thu Jun 1 17:42:49 PDT 2023
Took 0.0021 seconds
hbase:001:0> scan 'stateTemperatures',{LIMIT => 2}
ROW COLUMN+CELL
1 column=info:AverageTemperature, timestamp=2023-07-04T15:42:30.348, value=25.544
1 column=info:country, timestamp=2023-07-04T15:42:30.348, value=Brazil
1 column=info:dt, timestamp=2023-07-04T15:42:30.348, value=1855-05-01
1 column=info:state, timestamp=2023-07-04T15:42:30.348, value=Acre
1 column=info:uncertainty, timestamp=2023-07-04T15:42:30.348, value=1.171
10 column=info:AverageTemperature, timestamp=2023-07-04T15:42:30.348, value=24.658
10 column=info:country, timestamp=2023-07-04T15:42:30.348, value=Brazil
10 column=info:dt, timestamp=2023-07-04T15:42:30.348, value=1856-02-01
10 column=info:state, timestamp=2023-07-04T15:42:30.348, value=Acre
10 column=info:uncertainty, timestamp=2023-07-04T15:42:30.348, value=1.147
2 row(s)
Took 0.5937 seconds
hbase:002:0>