menu
Is this helpful?

# Custom Table Data Import Function

# I. Overview

In some cases, the data you need to use may not be represented in the form of user or event, such as some mapping tables, or some external data. If you need to use this part of data, you need to import custom data into the TA system through data_transfer command, which is associated with event table and user info table.

The following two import data sources are currently supported:

  • mysql: remote mysql database
  • txtfile: local file

# II. Instructions for Use

# 2.1 Command description

The command for data import is as follows:

ta-tool data_transfer -conf <config files> [--date xxx]

# 2.2 Command parameter description

# 2.2.1 -conf

The incoming parameters are the configuration file path of the imported table. Each table is a configuration file. It supports the simultaneous import of multiple tables. It supports wild-card methods, such as/data/config/ or ./config/.json

# 2.2.1 --date

Optional parameter --date: as an optional prameter, the parameter represents the data date, and the time macro will be replaced based on this reference time. The parameter may not pass, it is deemed to take the current date by default, in the format of YYYY-MM-DD. For the specific use of the time macro, please refer to time macro usage

# 2.3 Configuration file description

# 2.3.1 The sample configuration file for a single table is as follows:

{
  "parallel_num": 2,
  "source": {
    "type": "txtfile",
    "parameter": {
      "path": ["/data/home/ta/importer_test/data/*"],
      "encoding": "UTF-8",
      "column": ["*"],
      "fieldDelimiter": "\t"
    }
  },
  "target": {
    "appid": "test-appid",
    "table": "test_table",
    "table_desc": "import test table",
    "partition_value": "@[{yyyyMMdd}-{1day}]",
    "column": [
      {
        "name": "col1",
        "type": "timestamp",
        "comment": "timestamp"
      },
      {
        "name": "col2",
        "type": "varchar"
      }
    ]
  }
}

# 2.3.2 Description of outer layer parameters

  • parallel_num
    • Description: import the number of concurrent threads and control the rate of import
    • Type: int
    • Required: Yes
    • Default value: none
  • source
    • Description: configuration of specific parameters for importing data sources
    • Type: jsonObject
    • Required: Yes
    • Default value: none
  • target
    • Description: configuration of specific parameter configuration for exporting target tables
    • Type: jsonObject
    • Required: Yes
    • Default value: none

# 2.3.3 Specification of source parameters

  • type
    • Description: import the type of data sources, current import tools: three import data sources of txtfile, mysql and ftp. More data sources will be supported later
    • Type: string
    • Required: Yes
    • Default value: none
  • parameter

# 2.3.4 Specification of target parameters

  • appid
    • Description: import project appid corresponding to the table, which can be found in the background of the TA system
    • Type: string
    • Required: Yes
    • Default value: none
  • table
    • Description: import table name into the TA system. Note: The table name cannot be duplicated globally. It is recommended to add distinguishable prefixes or suffixes for different projects
    • Type: string
    • Required: Yes
    • Default value: none
  • table_desc
    • Description: import the comment of the table. It is recommended to configure this parameter when importing to facilitate the subsequent query of the table and clarify the meaning of the table
    • Type: string
    • Required: No
    • Default value: none
  • partition_value
    • Description: imported partition value. The custom table imported by the TA system will bring the partition field $pt by default, so the imported partition value must be specified when importing. Generally, it can be set as the imported data date, for example: 20180701, and also supports time macro replacement, for example: @[{yyyyMMdd}-{1day}]. Section 2.1 will introduce the specific usage
    • Type: string
    • Required: Yes
    • Default value: none
  • column
    • Description: define the table field definition imported into the TA system, including 3 property values of name, type and comment, where name and type required fields. The sample is as follows:
[
  {
    "name": "col1",
    "type": "timestamp",
    "comment": "timestamp"
  },
  {
    "name": "col2",
    "type": "varchar"
  }
]

When the source is mysql and the whole table is imported (that is, the column field is ["*"]), the column parameter cannot be passed in the target, and the import tool will follow the table structure in mysql. For the rest, this field must be passed

  • Type: jsonArray
  • Required: No
  • Default value: mysql source table schema definition

# 2.4 Time macro usage

You can replace time parameters with time macros inside the configuration file. ta-tool will use the imported start time as a benchmark to calculate the offset of time based on the parameters of the time macro, and replace the time macro in the configuration file. The available time macro formats are@[{yyyyMMdd}], @[{yyyyMMdd}-{nday}], @[{yyyyMMdd}+{nday}], @[{yyyMMdd} + {nday} ], etc.

  • yyyyMMddcan be replaced with any date format that can be parsed by Java dateFormat, for example:yyyy-MM-dd HH:mm:ss.SSS andyyyyMMddHH000000
  • N can be any integer, representing the offset value of time
  • Day represents the offset unit of time, which can be selected as follows: day, hour, minute, week and month
  • Example: Suppose the current time is 2018-07-01 15:13:23.234
    • @[{yyyyMMdd}] is replaced with 20180701
    • @[{yyyy-MM-dd}-{1day}] is replaced with 2018-06-31
    • @[{yyyyMMddHH}+{2hour}] is replaced with 2018070117
    • @[{yyyyMMddHHmm00}-{10minute}] is replaced with 20180701150300

# III. Configuration of Import Data Sources

This section will introduce the parameter configuration of different data sources. Currently, it supports 3 import data sources of txtfile, mysql and ftp. According to different data sources, you need to adjust the parameters of the source.

# 3.1 mysql data source

The data source is connected to the remote mysql database through the JDBC connector, and generates a query SELECT SQL statement based on the information configured by the user, and then sends it to the remote mysql database, and imports the result returned by the SQL execution into the table of the TA system.

# 3.1.1 Sample configuration

  • A sample configuration of importing the entire mysql table into the TA system:
{
  "parallel_num": 2,
  "source": {
    "type": "mysql",
    "parameter": {
      "username": "test",
      "password": "test",
      "column": ["*"],
      "connection": [
        {
          "table": ["test_table"],
          "jdbcUrl": ["jdbc:mysql://mysql-ip:3306/testDb"]
        }
      ]
    }
  },
  "target": {
    "appid": "test-appid",
    "table": "test_table_abc",
    "table_desc": "mysql test table",
    "partition_value": "@[{yyyy-MM-dd}-{1day}]"
  }
}
  • Custom SQL import to the sample configuration of the TA system:
{
  "parallel_num": 1,
  "source": {
    "type": "mysql",
    "parameter": {
      "username": "test",
      "password": "test",
      "connection": [
        {
          "querySql": [
            "select db_id,log_time from test_table where log_time>='@[{yyyy-MM-dd 00:00:00}-{1day}]' and log_time<'@[{yyyy-MM-dd 00:00:00}]'"
          ],
          "jdbcUrl": ["jdbc:mysql://mysql-ip:3306/testDb"]
        }
      ]
    }
  },
  "target": {
    "appid": "test-appid",
    "table": "test_table_abc",
    "table_desc": "mysql test table",
    "partition_value": "@[{yyyy-MM-dd}-{1day}]",
    "column": [
      {
        "name": "db_id",
        "type": "bigint",
        "comment": "db serial number"
      },
      {
        "name": "log_time",
        "type": "timestamp",
        "comment": "time stamp"
      }
    ]
  }
}

# 3.1.2 parameter description

  • jdbcUrl
    • Description: describe the JDBC connection information to the peer database, using the JSON array. Note that jdbcUrl must be included in the connection configuration unit. In general, the JSON array can be filled with a JDBC connection.
    • Type: jsonArray
    • Required: Yes
    • Default value: none
  • username
    • Description: user name of the data source
    • Type: string
    • Required: Yes
    • Default value: none
  • password
    • Description: Password for the user name specified by the data source
    • Type: string
    • Required: Yes
    • Default value: none
  • table
    • Description: selected tables that need to be synchronized. Use JSON's array description, so it supports simultaneous extraction of multiple tables. When configuring multiple tables, the user must ensure that multiple tables are of the same schema structure. MySQL Reader does not check whether the tables are the same logical table. Note that the table must be included in the connection configuration unit.
    • Type: jsonArray
    • Required: Yes
    • Default value: none
  • column
    • Description: set of column names in the configured table that need to be synchronized, using the JSON array to describe the field information. You can useto indicate that all column configurations are used by default, such as [“”].
    • Type: jsonArray
    • Required: Yes
    • Default value: none
  • where
    • Description: Filter, splice SQL according to the specified column, table and where conditions, and extract data according to this SQL. In actual business scenarios, the data of the previous day is often selected for synchronization, and the where condition can be specified as log_time>='@[{yyyy-MM-dd 00:00:00}-{1day}]' and log_time<'@[{yyyy-MM-dd 00:00:00}]''. Note: The where condition cannot be specified as limit 10. limit is not a legal where clause of SQL. The where condition can effectively synchronize business increments. If the where statement is not filled in, the import tool is regarded as synchronizing full data.
    • Type: string
    • Required: No
    • Default value: none
  • querySql
    • Description: In some business scenarios, where this configuration item is not sufficient to describe the filtered conditions, users can customize the filtered SQL through this configuration parameter. When the user configures this item, the import tool will ignore the configuration parameters of table and column, and directly use the content of this configuration item to filter the data. For example, the data needs to be synchronized after multi-table join: select a,b from table_a join table_b on table_a.id = table_b.id. When the user configures querySql, the import tool directly ignores table, column and where condition configuration, querySql priority is greater than the table, column and where options.
    • Type: string
    • Required: No
    • Default value: none

# 3.2 txtfile data source

The txtfile data source reads files on the local server and imports them into the TA's system table. The current usage restrictions and characteristics of txtfile are as follows:

  1. Support and only support reading TXT files, and require the schema in TXT to be a two-dimensional table
  2. Support CSV-like format files with custom delimiters
  3. Support multiple types of data reading (represented with string), column clipping and column constants
  4. Support recursive reading and file name filtering
  5. Support text compression, the existing compression format is zip, gzip and bzip2

# 3.2.1 Sample configuration

{
  "parallel_num": 5,
  "source": {
    "type": "txtfile",
    "parameter": {
      "path": ["/home/ftp/data/testData/*"],
      "column": [
        {
          "index": 0,
          "type": "long"
        },
        {
          "index": 1,
          "type": "string"
        }
      ],
      "encoding": "UTF-8",
      "fieldDelimiter": "\t"
    }
  },
  "target": {
    "appid": "test-appid",
    "table": "test_table_abc",
    "table_desc": "mysql test table",
    "partition_value": "@[{yyyy-MM-dd}-{1day}]",
    "column": [
      {
        "name": "db_id",
        "type": "bigint",
        "comment": "db serial number"
      },
      {
        "name": "log_time",
        "type": "timestamp",
        "comment": "time stamp"
      }
    ]
  }
}

# 3.2.2 parameter description

  • path
    • Description: path information of the local file system. Note that multiple paths can be filled in here. When specifying wildcard, the import tool tries to traverse multiple file information. For example, specify /data/, which means reading all files in the /datadirectory. Currently only the symbol is supported as a file wildcard. It is particularly important to note that the import tool will treat all Text Files synchronized under a job as the same data table. The user must ensure that all Files fit the same set of schema information. The file must be read in a CSV-like format.
    • Type: string
    • Required: Yes
    • Default value: none
  • column
    • Description: Read the list of fields. type specifies the type of source data, index specifies the current column from the text (starting with 0). value specifies the current type as a constant, does not read data from the source file, but automatically generates the corresponding column according to value.

By default, users can read all data according to the string type, configured as follows:

 "column": ["*"]

The user can specify the Column field information, configured as follows:

({
  "type": "long",
  "index": 0
},
{
  "type": "string",
  "value": "2018-07-01 00:00:00"
})

For user-specified Column information, type must be filled in and index/value must be selected.

The value range of type is:long, double, string aboolean

  • Type: jsonArray
  • Required: Yes
  • Default value: none
  • fieldDelimiter
    • Description: field delimiter for reading
    • Type: string
    • Required: Yes
    • Default value: ,
  • compress
    • Description: text compression type. By default, non-filling means no compression. Supported compression types are zip, gzip and bzip2.
    • Type: string
    • Required: No
    • Default value: no compression
  • encoding
    • Description: read the encoding configuration of the file.
    • Type: string
    • Required: No
    • Default value: utf-8
  • skipHeader
    • Description: The CSV-like format file may have the header as the title, which needs to be skipped. The default is not skipped.
    • Type: boolean
    • Required: No
    • Default value: false
  • nullFormat
    • Description: Standard string cannot be used to define null(null pointer) in text files. ta-tool provides nullFormat to define which strings can be represented as null. For example, if the user configures: nullFormat:"\N", then if the source data is "\N", ta-tool is treated as a null field.
    • Type: string
    • Required: No
    • Default value: \N