Shell中使用SparkSQL
Spark支持在命令行中使用SparkSQL,底层使用的是Hive提供的库表元数据信息,Spark SQL主要提供了两个工具,即CLI和ThriftServer。使用CLI和ThriftServer之前都需要先启动hive metastore。
1. 环境准备
1.1 安装Hive3.1.3
Spark3.x目前最新版本Spark3.5最高支持Hive3.1.3, 不支持Hive4.0。具体安装步骤参考Hive安装。此外额外添加Hive3.1.3支持的属性配置:
xml
<!--支持shell交互显示表头信息-->
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<!-- 显示当前数据库名 -->
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
另外配置支持动态分区写入
xml
<!-- 显示当前数据库名 -->
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>true</value>
</property>
配置完毕并启动Hive3.1.3元数据服务:
sh
[jack@hadoop103 ~]$ hive --service metastore &
1.2 配置Spark
编辑conf/spark-defaults.conf文件, 添加如下内容:
ini
spark.sql.hive.metastore.version 3.1.3
spark.sql.hive.metastore.jars path
spark.sql.hive.metastore.jars.path file:///opt/module/hive-3.1.3/lib/*
拷贝hive-site.xml到conf目录下:
sh
cp /opt/module/hive-3.1.3/conf/hive-site.xml /opt/module/spark-3.4.2/conf/
2. 使用CLI
sh
[jack@hadoop103 ~]$ spark-sql
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/17 10:07:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2025-02-17T10:07:37,778 INFO [main] org.apache.hadoop.hive.conf.HiveConf - Found configuration file file:/opt/module/hive-3.1.3/conf/hive-site.xml
Hive Session ID = 74a0e0d4-484a-4cf4-9abd-4155d238bcef
2025-02-17T10:07:37,976 INFO [main] SessionState - Hive Session ID = 74a0e0d4-484a-4cf4-9abd-4155d238bcef
2025-02-17T10:07:38,163 INFO [main] org.apache.hadoop.hive.metastore.HiveMetaStoreClient - Trying to connect to metastore with URI thrift://hadoop103:9083
2025-02-17T10:07:38,189 INFO [main] org.apache.hadoop.hive.metastore.HiveMetaStoreClient - Opened a connection to metastore, current connections: 1
2025-02-17T10:07:38,223 INFO [main] org.apache.hadoop.hive.metastore.HiveMetaStoreClient - Connected to metastore.
2025-02-17T10:07:38,224 INFO [main] org.apache.hadoop.hive.metastore.RetryingMetaStoreClient - RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=jack (auth:SIMPLE) retries=1 delay=1 lifetime=0
Spark master: local[*], Application Id: local-1739758055536
spark-sql (default)> show namespaces;
namespace
default
Time taken: 0.321 seconds, Fetched 1 row(s)
2.1 执行sql脚本
使用-f
参数指定sql脚本文件,从而批量执行sql:
sh
[jack@hadoop103 ~]$ cat test.sql
create external table if not exists t_demo2(id int, name string) partitioned by (age int) LOCATION '/hive3/warehouse/t_demo2';
insert into t_demo2 values (1,'anne', 23);
[jack@hadoop103 ~]$ spark-sql -f test.sql
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/17 10:06:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2025-02-17T10:06:27,863 INFO [main] org.apache.hadoop.hive.conf.HiveConf - Found configuration file file:/opt/module/hive-3.1.3/conf/hive-site.xml
Hive Session ID = 1918f56f-06b8-4185-a201-4cdf8f3f5b61
2025-02-17T10:06:28,072 INFO [main] SessionState - Hive Session ID = 1918f56f-06b8-4185-a201-4cdf8f3f5b61
2025-02-17T10:06:28,264 INFO [main] org.apache.hadoop.hive.metastore.HiveMetaStoreClient - Trying to connect to metastore with URI thrift://hadoop103:9083
2025-02-17T10:06:28,290 INFO [main] org.apache.hadoop.hive.metastore.HiveMetaStoreClient - Opened a connection to metastore, current connections: 1
2025-02-17T10:06:28,340 INFO [main] org.apache.hadoop.hive.metastore.HiveMetaStoreClient - Connected to metastore.
2025-02-17T10:06:28,340 INFO [main] org.apache.hadoop.hive.metastore.RetryingMetaStoreClient - RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=jack (auth:SIMPLE) retries=1 delay=1 lifetime=0
Spark master: local[*], Application Id: local-1739757985889
25/02/17 10:06:29 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.
Response code
Time taken: 3.749 seconds
25/02/17 10:06:33 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
25/02/17 10:06:33 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
25/02/17 10:06:33 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
2025-02-17T10:06:41,013 INFO [load-dynamic-partitions-0] hive.ql.metadata.Hive - New loading path = hdfs://hadoop102:8020/hive3/warehouse/t_demo2/.hive-staging_hive_2025-02-17_10-06-36_967_4186639222698019620-1/-ext-10000/age=23 with partSpec {age=23}
2025-02-17T10:06:41,037 INFO [load-dynamic-partitions-0] org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAccessController - Created SQLStdHiveAccessController for session context : HiveAuthzSessionContext [sessionString=1918f56f-06b8-4185-a201-4cdf8f3f5b61, clientType=HIVECLI]
2025-02-17T10:06:41,039 WARN [load-dynamic-partitions-0] org.apache.hadoop.hive.ql.session.SessionState - METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
......
spark-sql (default)> select * from t_demo2;
2025-02-17T10:08:17,774 INFO [main] org.apache.hadoop.hive.metastore.HiveMetaStoreClient - Trying to connect to metastore with URI thrift://hadoop103:9083
2025-02-17T10:08:17,775 INFO [main] org.apache.hadoop.hive.metastore.HiveMetaStoreClient - Opened a connection to metastore, current connections: 3
2025-02-17T10:08:17,776 INFO [main] org.apache.hadoop.hive.metastore.HiveMetaStoreClient - Connected to metastore.
2025-02-17T10:08:17,776 INFO [main] org.apache.hadoop.hive.metastore.RetryingMetaStoreClient - RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=jack (auth:SIMPLE) retries=1 delay=1 lifetime=0
id name age
1 anne 23
Time taken: 1.639 seconds, Fetched 1 row(s)
查看HDFS目录,发现数据已经进行了分区: 启动一个CLI其实就启动了一个Spark Application,它只能给一个用户使用。
3. 使用ThriftServer
通过Spark原生脚本${SPARK_HOME}/sbin/start-thriftserver.sh启动的分布式sql执行引擎。其本质也是一个spark程序,对应的类是org.apache.spark.sql.hive.thriftserver.HiveThriftServer2,(可以看到是对hive server2的一个wrapper),它包括一个driver,也包括若干executor。
3.1 启动ThriftServer
sh
[jack@hadoop103 ~]$ cd /opt/module/spark-3.4.2/
./sbin/start-thriftserver.sh
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /opt/module/spark-3.4.2/logs/spark-jack-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-hadoop103.out
[jack@hadoop103 spark-3.4.2]$ tail -f /opt/module/spark-3.4.2/logs/spark-jack-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-hadoop103.out
3.2 使用beeline连接
sh
[jack@hadoop103 spark-3.4.2]$ ./bin/beeline -u jdbc:hive2://hadoop103:10000 -n jack
Connecting to jdbc:hive2://hadoop103:10000
Connected to: Spark SQL (version 3.4.2)
Driver: Hive JDBC (version 2.3.9)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.9 by Apache Hive
Beeline version 2.3.9 by Apache Hive
0: jdbc:hive2://hadoop103:10000> show tables;
+------------+------------+--------------+
| namespace | tableName | isTemporary |
+------------+------------+--------------+
| default | t_test | false |
+------------+------------+--------------+
1 row selected (0.612 seconds)
在Spark SQL的CLI和ThriftServer中,比较推荐使用后者,因为后者更加轻量,只需要启动一个ThriftServer就可以给多个beeline客户端使用。