目录
此内容是否有帮助?

# Custom Table Data Import Function

# I. Overview

In some cases, the data you need to use may not be represented in the form of useror event, such as some mapping tables , or some external data. If you need to use this part of data, you need to import custom data into the TA system through data_transfercommands, combined with transaction table and user table association.

The following two import data sources are currently supported:

  • Mysql: remote mysql database
  • Txtfile: local file

# II. Instructions for Use

# 2.1 Command Description

The command for data import is as follows:

ta-tool data_transfer -conf <config files> [--date xxx]

# 2.2 Command Parameter Description

# 2.2.1 -conf

The incoming parameters are the configuration file path of the imported table. Each table is a configuration file. It supports the simultaneous import of multiple tables. It supports wild-card methods, such as: /data/config/*or ./config/* .json

# 2.2.1 --date

Optional parameter ** --date **: optional, the parameter indicates the data date, the time macro will be replaced based on this reference time, can not pass, not pass the default to take the current date, the format is YYYY-MM-DD, the specific use of the time macro, you can refer to the use of time macro

# 2.3 Description of Configuration File

# 2.3.1 The sample configuration file for a single table is as follows:

{
  "parallel_num": 2,
  "source": {
    "type": "txtfile",
    "parameter": {
      "path": ["/data/home/ta/importer_test/data/*"],
      "encoding": "UTF-8",
      "column": ["*"],
      "fieldDelimiter": "\t"
    }
  },
  "target": {
    "appid": "test-appid",
    "table": "test_table",
    "table_desc": "Import Test Sheet",
    "partition_value": "@[{yyyyMMdd}-{1day}]",
    "column": [
      {
        "name": "col1",
        "type": "timestamp",
        "comment": "time stamp"
      },
      {
        "name": "col2",
        "type": "varchar"
      }
    ]
  }
}

# 2.3.2 Description of outer layer parameters

  • parallel_num
    • Description: Import the number of concurrent threads and control the rate of import
    • Type: int
    • Required: Yes
    • Default: None
  • source
    • Description: Configuration of specific parameters for importing data sources
    • Type: jsonObject
    • Required: Yes
    • Default: None
  • target
    • Description: Export the specific parameter configuration of the target table
    • Type: jsonObject
    • Required: Yes
    • Default: None

# 2.3.3 Specification of source parameters

  • type
    • Description: Import data source type, the current import tool support: txtfile, mysql, ftpthree import data sources, the subsequent will add more data source support
    • Type: string
    • Required: Yes
    • Default: None
  • parameter

# 2.3.4 Specification of target parameters

  • appid
    • Description: Import the project appid corresponding to the table, which can be found in the background of the TA system
    • Type: string
    • Required: Yes
    • Default: None
  • table
    • Description: The table name imported into the TA system. Note: The table name cannot be duplicated globally. It is recommended to add distinguishable prefixes or suffixes based on different items
    • Type: string
    • Required: Yes
    • Default: None
  • table_desc
    • Description: Import the comment of the table. It is recommended to configure this parameter when importing to facilitate the subsequent query of the table to clarify the meaning of the table
    • Type: string
    • Required: No
    • Default: empty
  • partition_val
    • Description: The imported partition value, the custom table imported by the TA system will bring the partition field $ptby default, so the imported partition value must be specified when importing. Generally, it can be set as the imported data date, for example: 20180701, and also supports time macro replacement, for example: @[{yyyMMdd} - {1day}], Section 2.1 will introduce the specific use method
    • Type: string
    • Required: Yes
    • Default: None
  • column
    • Description: Define the definition of the table field imported into the TA system, including name, type, comment3 attribute values, where nameand typeare required fields, the sample is as follows:
[
  {
    "name": "col1",
    "type": "timestamp",
    "comment": "time stamp"
  },
  {
    "name": "col2",
    "type": "varchar"
  }
]

When the source is mysql and the whole table is imported (that is, the columnfield is ["*"]), the columnparameter can not be passed in the target, and the import tool will follow the table structure in mysql. For the rest, this field must be passed

  • Type: jsonArray
  • Required: No
  • Default: mysql source table schema definition

# 2.4 Time Macros Usage

Inside the configuration file, you can use time macros to replace time parameters. The ta-tool tool will use the imported start time as a benchmark, calculate the offset of time based on the parameters of the time macro, and replace the time macro in the configuration file. Supported time macro formats: @[{yyyyMMdd}], @[{yyyMMdd} - {nday}], @[{yyyMMdd} + {nday}]etc

  • YyyyMMddcan be replaced with any date format that can be parsed Java dateFormat, for example: yyyy-MM-dd HH: mm: ss. SSS, yyyyMMddHH000000
  • N can be any integer representing the offset of time
  • Day represents the offset unit of time, which can be taken as follows: day, hour, minute, week, month
  • Example: Suppose the current time is 2018-07-01 15:13:23.234
    • @[{yyyyMMdd}]Replace with 20180701
    • @[{yyyy-MM-dd} - {1day}]Replace with 2018-06-31
    • @[{yyyyMMddHH} + {2hour}]Replace with 2018070117
    • @[{yyyyMMddHmm00} - {10minute}]Replace with 20180701150300

# III. Configuration of Import Data Sources

This section will introduce the parameter configuration of different data sources. Currently, it supports: txtfile, mysql, and ftpto import data sources. According to different data sources, you need to adjust the parameters of the source.

# 3.1 mysql data source

The data source is connected to the remote mysql database through the JDBC connector, and generates a query SELECT SQL statement based on the information configured by the user, and then sends it to the remote mysql database, and imports the result returned by the SQL execution into the table of the TA system.

# 3.1.1 Sample configuration

  • A sample configuration example of importing the entire mysql table into the TA system:
{
  "parallel_num": 2,
  "source": {
    "type": "mysql",
    "parameter": {
      "username": "test",
      "password": "test",
      "column": ["*"],
      "connection": [
        {
          "table": ["test_table"],
          "jdbcUrl": ["jdbc:mysql://mysql-ip:3306/testDb"]
        }
      ]
    }
  },
  "target": {
    "appid": "test-appid",
    "table": "test_table_abc",
    "table_desc": "mysql test table",
    "partition_value": "@[{yyyy-MM-dd}-{1day}]"
  }
}
  • Custom SQL import to TA system configuration sample:
{
  "parallel_num": 1,
  "source": {
    "type": "mysql",
    "parameter": {
      "username": "test",
      "password": "test",
      "connection": [
        {
          "querySql": [
            "select db_id,log_time from test_table where log_time>='@[{yyyy-MM-dd 00:00:00}-{1day}]' and log_time<'@[{yyyy-MM-dd 00:00:00}]'"
          ],
          "jdbcUrl": ["jdbc:mysql://mysql-ip:3306/testDb"]
        }
      ]
    }
  },
  "target": {
    "appid": "test-appid",
    "table": "test_table_abc",
    "table_desc": "mysql test table",
    "partition_value": "@[{yyyy-MM-dd}-{1day}]",
    "column": [
      {
        "name": "db_id",
        "type": "bigint",
        "comment": "db serial number"
      },
      {
        "name": "log_time",
        "type": "timestamp",
        "comment": "time stamp"
      }
    ]
  }
}

# 3.1.2 parameter description

  • jdbcUrl
    • Description: Describes the JDBC connection information to the peer database, using the JSON array description. Note that jdbcUrl must be included in the connection configuration unit. In general, the JSON array can be filled with a JDBC connection.
    • Type: jsonArray
    • Required: Yes
    • Default: None
  • username
    • Description: User name of the data source
    • Type: string
    • Required: Yes
    • Default: None
  • password
    • Description: Password for the user name specified by the data source
    • Type: string
    • Required: Yes
    • Default: None
  • table
    • Description: The selected tables that need to be synchronized. Use JSON's array description, so it supports simultaneous extraction of multiple tables. When configuring multiple tables, the user must ensure that multiple tables are of the same schema structure. MySQL Reader does not check whether the tables are the same logical table. Note that the table must be included in the connection configuration unit.
    • Type: jsonArray
    • Required: Yes
    • Default: None
  • column
    • Description: The set of column names in the configured table that need to be synchronized uses the JSON array to describe the field information. You can use the *to indicate that all column configurations are used by default, such as [“*”].
    • Type: jsonArray
    • Required: Yes
    • Default: None
  • where
    • Description: Filter criteria, splice SQL according to the specified column, tableand whereconditions, and extract data according to this SQL. In actual business scenarios, the data of the previous day is often selected for synchronization, and the where condition can be specified as log_time > = '@[{yyyy-MM-dd 00:00:00} - {1day}]' and log_time < '@[{yyyy-MM-dd 00:00:00}]'. Note: The where condition cannot be specified as limit 10. Limitis not a legal where clause of SQL. The where condition can effectively synchronize business increments. If the where statement is not filled in, the import tool is regarded as synchronizing the full amount of data.
    • Type: string
    • Required: No
    • Default: None
  • querySql
    • Description: In some business scenarios, where this configuration item is not sufficient to describe the filtered conditions, users can customize the filtered SQL through this configuration parameter. When the user configures this item, the import tool will ignore the configuration parameters of tableand column, and directly use the content of this configuration item to filter the data. For example, the data needs to be synchronized after multi-table join: select a, b from table_a join table_b on table_a.id = table_b.id, when the user configures querySql, the import tool directly ignores table, column, Where condition configuration, querySql priority is greater than the table, column, where option.
    • Type: string
    • Required: No
    • Default: None

# 3.2 Txtfile Data Source

The txtfile data source reads files on the local server and imports them into the TA's system table. The current usage restrictions and characteristics of txtfile are as follows:

  1. Supports and only supports reading TXT files, and requires the schema in TXT to be a two-dimensional table
  2. Supports class CSV format files with custom delimiters
  3. Support multiple types of data reading (using string representation), support column clipping, support column constants
  4. Supports recursive reading and filename filtering
  5. Support text compression, the existing compression format is zip, gzip, bzip2

# 3.2.1 Sample configuration

{
  "parallel_num": 5,
  "source": {
    "type": "txtfile",
    "parameter": {
      "path": ["/home/ftp/data/testData/*"],
      "column": [
        {
          "index": 0,
          "type": "long"
        },
        {
          "index": 1,
          "type": "string"
        }
      ],
      "encoding": "UTF-8",
      "fieldDelimiter": "\t"
    }
  },
  "target": {
    "appid": "test-appid",
    "table": "test_table_abc",
    "table_desc": "mysql test table",
    "partition_value": "@[{yyyy-MM-dd}-{1day}]",
    "column": [
      {
        "name": "db_id",
        "type": "bigint",
        "comment": "db serial number"
      },
      {
        "name": "log_time",
        "type": "timestamp",
        "comment": "time stamp"
      }
    ]
  }
}

# 3.2.2 parameter description

  • path
    • Description: The path information of the local file system. Note that multiple paths can be filled in here. When specifying wild-card, the import tool tries to traverse multiple file information. For example, specify /data/*, which means reading all files in the /datadirectory. Currently only the *symbol is supported as a file wild-card. It is particularly important to note that the import tool will treat all Text Files synchronized under a job as the same data table. The user must ensure that all Files fit the same set of schema information. The file must be read in a CSV-like format.
    • Type: string
    • Required: Yes
    • Default: None
  • column
    • Description: Read the list of fields, typespecifies the type of source data, indexspecifies the current column from the text (starting with 0), valuespecifies the current type as a constant, does not read data from the source file, but automatically generates the corresponding column according to the valuevalue.

By default, users can read all data according to the stringtype, configured as follows:

 "column": ["*"]

The user can specify the Column field information, configured as follows:

({
  "type": "long",
  "index": 0
},
{
  "type": "string",
  "value": "2018-07-01 00:00:00"
})

For user-specified Column information, typemust be filled in and index/valuemust be selected.

The range of type is: long, double, string, boolean

  • Type: jsonArray
  • Required: Yes
  • Default: None
  • fieldDelimiter
    • Description: Field separator for reading
    • Type: string
    • Required: Yes
    • Default value: ,
  • compress
    • Description: Text compression type. Default non-filling means no compression. Supported compression types are zip, gzip, bzip2.
    • Type: string
    • Required: No
    • Default: No compression
  • encoding
    • Description: Read the encoding configuration of the file.
    • Type: string
    • Required: No
    • Default: utf-8
  • skipHeader
    • Description: The class CSV format file may have the header as the title, which needs to be skipped. The default is not skipped.
    • Type: boolean
    • Required: No
    • Default: false
  • zero format
    • Description: Standard string cannot be used to define null(null pointer) in text files. Ta-toolprovides nullFormatto define which strings can be represented as null. For example, if the user configures: nullFormat: "\ N", then if the source data is "\ N", ta-tool is treated as a nullfield.
    • Type: string
    • Required: No
    • Default: \ N