Skip to content

数仓代码框架和环境搭建

1. 创建项目工程

创建maven工程gmall2023-realtime,工程的pom:

xml
<modules>
    <module>realtime-common</module>
    <module>realtime-dwd</module>
    <module>realtime-dws</module>
    <module>realtime-dim</module>
</modules>

<properties>
    <maven.compiler.source>8</maven.compiler.source>
    <maven.compiler.target>8</maven.compiler.target>
    <java.version>1.8</java.version>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <flink.version>1.17.2</flink.version>
    <scala.version>2.12</scala.version>
    <hadoop.version>3.3.6</hadoop.version>
    <flink-cdc.vesion>2.4.2</flink-cdc.vesion>
    <jdbc-connector.version>3.1.0-1.17</jdbc-connector.version>
    <fastjson.version>1.2.83</fastjson.version>
    <hbase.version>2.6.1-hadoop3</hbase.version>
    <slf4j.version>1.7.36</slf4j.version>
</properties>

<dependencies>
    <!-- 每个module都需要的公共依赖,module会自动继承-->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-java</artifactId>
        <version>${flink.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-clients</artifactId>
        <version>${flink.version}</version>
        <scope>provided</scope>
    </dependency>
    <!--在 idea 运行的时候,可以打开 web 页面-->
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-runtime-web</artifactId>
        <version>${flink.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-json</artifactId>
        <version>${flink.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-csv</artifactId>
        <version>${flink.version}</version>
        <scope>provided</scope>
    </dependency>
    <!--Flink默认使用的是slf4j记录日志,相当于一个日志的接口,我们这里使用log4j作为具体的日志实现-->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <version>${slf4j.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>${slf4j.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-to-slf4j</artifactId>
        <version>2.21.1</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-table-api-java-bridge</artifactId>
        <version>${flink.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-table-planner-loader</artifactId>
        <version>${flink.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-table-runtime</artifactId>
        <version>${flink.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-files</artifactId>
        <version>${flink.version}</version>
        <scope>provided</scope>
    </dependency>
</dependencies>

<dependencyManagement>
    <!-- 工程统一管理依赖版本,module指定才会引入,引入时不需要指定version-->
    <dependencies>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>${fastjson.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>com.ververica</groupId>
            <artifactId>flink-connector-mysql-cdc</artifactId>
            <version>${flink-cdc.vesion}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-hbase-2.2</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.doris</groupId>
            <artifactId>flink-doris-connector-1.17</artifactId>
            <version>1.6.2</version>
        </dependency>
        <!-- hbase 依赖-->
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>${hbase.version}</version>
        </dependency>
        <dependency>
            <groupId>commons-beanutils</groupId>
            <artifactId>commons-beanutils</artifactId>
            <version>1.10.1</version>
        </dependency>
        <dependency>
            <groupId>com.janeluo</groupId>
            <artifactId>ikanalyzer</artifactId>
            <version>2012_u6</version>
        </dependency>
        <dependency>
            <groupId>redis.clients</groupId>
            <artifactId>jedis</artifactId>
            <version>3.3.0</version>
        </dependency>
        <dependency>
            <groupId>io.lettuce</groupId>
            <artifactId>lettuce-core</artifactId>
            <version>6.3.2.RELEASE</version>
        </dependency>
    </dependencies>
</dependencyManagement>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.1.1</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <artifactSet>
                            <excludes>
                                <exclude>com.google.code.findbugs:jsr305</exclude>
                                <exclude>org.slf4j:*</exclude>
                                <exclude>log4j:*</exclude>
                                <exclude>org.apache.hadoop:*</exclude>
                            </excludes>
                        </artifactSet>
                        <filters>
                            <filter>
                                <!-- Do not copy the signatures in the META-INF folder.Otherwise, this might cause SecurityExceptions when using the JAR. -->
                                <!-- 打包时不复制META-INF下的签名文件,避免报非法签名文件的SecurityExceptions异常-->
                                <artifact>*:*</artifact>
                                <excludes>
                                    <exclude>META-INF/*.SF</exclude>
                                    <exclude>META-INF/*.DSA</exclude>
                                    <exclude>META-INF/*.RSA</exclude>
                                </excludes>
                            </filter>
                        </filters>

                        <transformers combine.children="append">
                            <!-- The service transformer is needed to merge META-INF/services files -->
                            <!-- connector和format依赖的工厂类打包时会相互覆盖,需要使用ServicesResourceTransformer解决-->
                            <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                        </transformers>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

父工程将打包时不必包含在内的依赖添加到父工程的pom文件中,这部分依赖的scope均为provided,在父工程的POM文件中通过dependencyManagement标签声明所有依赖的版本。

2. 项目模块设计

本项目准备一个父工程,父工程下创建四个module如下:
(1)realtime-common:用于引入公共的第三方依赖,编写工具类、实体类等。
(2)realtime-dim:用于编写DIM层业务代码。
(3)realtime-dwd:用于编写DWD层业务代码。
(4)realtime-dws:用于编写DWS层业务代码。
其中,后三个module统称为业务模块,业务模块都要将realtime-common模块作为依赖引入。

2.1 realtime-common工程介绍

本项目需要引入大量依赖,除去父类中引入的部分(集群已有的依赖),还有很多集群没有的依赖(下文称为第三方依赖),打包时都要包含在jar包里,这样就会导致业务模块打包时jar包很大,打包时间很长,部署效率不高。为了改善这一情况,我们构建公共模块realtime-common,将第三方依赖都在公共模块的POM文件中引入,业务模块只需要引入realtime-common依赖即可。但要注意,如果realtime-common的scope为默认的compile,业务模块打包时依然会包含所有的第三方依赖,不能解决问题,所以我们将scope更改为provided,并将打好的realtime-common包放到集群Flink的lib目录下。
同时为了精简子模块的代码,我们将基类、工具类、实体类、常量类、通用的自定义函数全部放到realtime-common中。 realtime-common的pom如下:

xml
<dependencies>
    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>fastjson</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-kafka</artifactId>
    </dependency>
    <dependency>
        <groupId>com.ververica</groupId>
        <artifactId>flink-connector-mysql-cdc</artifactId>
    </dependency>
    <!-- hbase 依赖-->
    <dependency>
        <groupId>org.apache.hbase</groupId>
        <artifactId>hbase-client</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-hbase-2.2</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.doris</groupId>
        <artifactId>flink-doris-connector-1.17</artifactId>
    </dependency>
    <dependency>
        <groupId>commons-beanutils</groupId>
        <artifactId>commons-beanutils</artifactId>
    </dependency>
    <dependency>
        <groupId>io.lettuce</groupId>
        <artifactId>lettuce-core</artifactId>
    </dependency>
</dependencies>

在resources目录下创建log4j.properties文件:

ini
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %10p (%c:%M) - %m%n
log4j.rootLogger=error,stdout

Flink使用log4j作为默认的日志框架,启动应用程序,控制台会打印日志,在common模块的resources目录下准备一份log4j配置文件即可对所有模块生效。
在realtime-common包下创建如下包结构:
Alt text

2.2 创建子模块realtime-dim

业务模块只需要依赖realtime-common即可:

xml
<dependencies>
    <dependency>
        <groupId>org.rocket.flink</groupId>
        <artifactId>realtime-common</artifactId>
        <version>1.0-SNAPSHOT</version>
    </dependency>
</dependencies>

2.3 创建子模块realtime-dwd

业务模块只需要依赖realtime-common即可:

xml
<dependencies>
    <dependency>
        <groupId>org.rocket.flink</groupId>
        <artifactId>realtime-common</artifactId>
        <version>1.0-SNAPSHOT</version>
    </dependency>
</dependencies>

2.4 创建子模块realtime-dws

业务模块只需要依赖realtime-common即可:

xml
<dependencies>
    <dependency>
        <groupId>org.rocket.flink</groupId>
        <artifactId>realtime-common</artifactId>
        <version>1.0-SNAPSHOT</version>
    </dependency>
</dependencies>

3. 代码托管到Gitee

3.1 安装Gitee插件

Alt text

3.2 使用Git进行版本控制

Alt text

3.3 代码上传到Gitee服务器

Alt text 点击登录账号,会跳转到浏览器进行用户登录授权: Alt text 点击共享: Alt text 选择上传的文件,点击add: Alt text 上传完毕后,右下角会提示: Alt text

4. 数仓组件搭建

4.1 MySQL环境搭建

请参考MySQL安装

4.2 Kafka环境搭建

请参考Kafka集群部署

4.3 Maxwell环境搭建

请参考Maxwell部署

4.4 Flink环境搭建

请参考flink集群部署

4.5 Hbase环境搭建

请参考Hbase安装

4.6 Redis环境搭建

请参考Redis单机版安装

4.7 Doris环境搭建

请参考Doris集群部署