# Data Archiving and Backfiling

# I. Data Filing Function (ta_data_archive)

The data archiving function is to migrate and archive some historical data or data that does not need to be used for the time being to cheap storage. So as to release the disk resources of the TA cluster and save the use cost.

# 1.1 Archiving Orders

#start
ta-tool data_archive start

#stop
ta-tool data_archive stop

#retry
ta-tool data_archive retry -jobid *******

# 1.2 Archiving Method

# 1.2.1 S3 method

# 1.2.1.1 environmental preparation

Apply for Amazon S3 service
Create a bucket (Bucket) for archiving, and the area of the bucket is recommended to be consistent with the TA cluster server
Create the key to access the bucket

# 1.2.1.2 command sample

[ta@ta1 ~]$ ta-tool data_archive start
Please enter JobId for this job without background random generation>
------------------------------------------------------------
Please enter the project appid that needs to be archived> 5487f6b**********f9c379aa9bb
------------------------------------------------------------
Please enter the start time for project archiving：YYYY-MM-DD > 2018-01-01
------------------------------------------------------------
Please enter the end time of project archiving：YYYY-MM-DD > 2018-12-31
------------------------------------------------------------
Please enter the event type for the project archive (not required)>
------------------------------------------------------------
Please enter the type of archive storage：hdfs or rsync or s3 > s3
------------------------------------------------------------
Please enter S3 AccesskeyID> AK************YO6G3
------------------------------------------------------------
Please enter S3 secretAccessKey> J23************rZb
------------------------------------------------------------
Please enter S3 zone code> cn-****-1
------------------------------------------------------------
Please enter S3 bucket name> ta************ive
------------------------------------------------------------
Please enter the S3 file storage class (default: STANDARD)> S*****D
------------------------------------------------------------
Please enter the target directory for the project archive> data*****_test
------------------------------------------------------------

# 1.2.1.3 step description

Enter jobid, which can be customized or generated in the background. In order to specify jobid when the task fails and reruns.
Enter the project appid
Enter the start date (outside the most recent month range)
Enter the end date (outside the last month range)
Enter the specified event type (optional) to archive a single event type
Type of archive storage Select S3
Enter the accesskeyid for s3
Enter the secretAccessKey (managed in the S3 IAM service)
Specify bucket (opens new window)area code
Enter bucket name
Select the storage type (opens new window)(standard mode by default). The GLACIER and DEEP_ARCHIVE storage classes in the storage type are designed for low-cost data archiving, but they require thawing operations during data recovery. It is more cumbersome.
The archived target directory, (the directory will be created under the target bucket, and the archived data will be placed in the directory)

# 1.2.2 HDFS method

# 1.2.2.1 environmental preparation

Prepare HDFS environment interworking with TA cluster network

# 1.2.2.2 command sample

[ta@ta1 ~]$ ta-tool data_archive start
Please enter JobId for this job without background random generation>
------------------------------------------------------------
Please enter the project appid that needs to be archived> 5487************a9bb
------------------------------------------------------------
Please enter the start time for project archiving：YYYY-MM-DD > 2018-01-01
------------------------------------------------------------
Please enter the end time of project archiving：YYYY-MM-DD > 2018-12-31
------------------------------------------------------------
Please enter the event type for the project archive (not required)>
------------------------------------------------------------
Please enter the type of archive storage：hdfs or rsync or s3 > hdfs
------------------------------------------------------------
Please enter the HFDS URL address for the project archive> hdfs-nm-url
------------------------------------------------------------
Please enter the user name of the project archive's HFDS> hdfsUserName
------------------------------------------------------------
Please enter the target directory for the project archive> hdfs******test
------------------------------------------------------------

# 1.2.2.3 step description

Enter jobid, which can be customized, or generated in the background, in order to specify jobid when the task fails and reruns.
Enter the project appid
Enter the start date (outside the most recent month range)
Enter the end date (outside the most recent month range)
Enter the specified event type (optional) to archive an event type separately
Archive storage type select hdfs
Enter the hdfs address of the writing end, if the port defaults to fill in the hostname
Enter the user name of hdfs on the write side
Enter the target directory of the archive, it is recommended to use an absolute path, otherwise it will be stored in the/user/hdfs user directory/target directory/

# 1.2.3 rsync method

# 1.2.3.1 environmental preparation

Use rsync daemon mode to build a good server level, and copy the secret key file to the command running node in the TA cluster

# 1.2.3.2 command sample

[ta@ta1 ~]$ ta-tool data_archive start
Please enter JobId for this job without background random generation>
------------------------------------------------------------
Please enter the project appid that needs to be archived> 548*****************9bb
------------------------------------------------------------
Please enter the start time for project archiving：YYYY-MM-DD > 2018-01-01
------------------------------------------------------------
Please enter the end time of project archiving：YYYY-MM-DD > 2018-12-31
------------------------------------------------------------
Please enter the event type for the project archive (not required)>
------------------------------------------------------------
Please enter the type of archive storage：hdfs or rsync or s3 > rsync
------------------------------------------------------------
Please enter the destination RSYNC server IP address> rsyncIp
------------------------------------------------------------
Please enter the target RSYNC server port> rsyncPort
------------------------------------------------------------
Please enter the target RSYNC server user name> rsyncUser
------------------------------------------------------------
Please enter the location of the target RSYNC server key file> passwordFilePath
------------------------------------------------------------
Please enter the target RSYNC server module name> modelName
------------------------------------------------------------
sending incremental file list
/tmp/
/tmp/d41d8c*****ecf8427e.data

sent 99 bytes  received 15 bytes  228.00 bytes/sec
total size is 11  speedup is 0.10 (DRY RUN)
Please enter the target directory for the project archive> rsync******test_dir

# 1.2.3.3 step description

Enter jobid, which can be customized, or generated in the background, in order to specify jobid when the task fails and reruns.
Enter the project appid
Enter the start date (outside the most recent month range)
Enter the end date (outside the most recent month range)
Enter the specified event type (optional) to archive an event type separately
Type of archive storage Select rsync
Enter rsync server level ip
Enter rsync server port
Enter tsync username
Enter the file location of the rsync key and put it in a certain directory. The file permissions must be guaranteed to be chmod 600 permissions
Enter the module name of rsync (this step will use the information previously entered to verify whether rsync is available)
Enter the archived target directory

# II. Data Archiving Function (ta_data_reload)

The data archive function is to import the previously archived data into the TA cluster and use it again, generally when viewing trends over the years.

Make sure you have enough disk space before importing.

# 2.1 File Back Command

#Start
ta-tool data_reload start

#Stop
ta-tool data_reload stop

#Retry
ta-tool data_reload retry -jobid *******

# 2.2 Archive Mode

# 2.2.1 S3 method

# 2.2.1.1 environmental preparation

Apply for Amazon S3 service
Create a bucket (Bucket) for archiving, and the area of the bucket is recommended to be consistent with the TA cluster server
Create the key to access the bucket

# 2.2.1.2 command sample

[ta@ta1 log]$ ta-tool data_reload start
Please enter JobId for this job without background random generation>
------------------------------------------------------------
Please enter the project appid that needs to be archived> 5487f6************a9bb
------------------------------------------------------------
Please enter the start time for project archiving：YYYY-MM-DD > 2018-01-01
------------------------------------------------------------
Please enter the end time of project archiving：YYYY-MM-DD > 2018-12-31
------------------------------------------------------------
Please enter the event type for the project archive (not required)>
------------------------------------------------------------
Please enter the type of archive storage：hdfs or rsync or s3 > s3
------------------------------------------------------------
Please enter S3 AccesskeyID> AK***********3
------------------------------------------------------------
Please enter S3 secretAccessKey> J23w************b
------------------------------------------------------------
Please enter S3 Area Code> cn*****-1
------------------------------------------------------------
Please enter S3 Bucket name> ta*****ve
------------------------------------------------------------
Please enter the target directory for the project archive> data*******t_1
------------------------------------------------------------

# 2.2.1.3 step description

Enter jobid, which can be customized, or generated in the background, in order to specify jobid when the task fails and reruns.
Enter the project appid
Enter the start date (outside the most recent month range)
Enter the end date (outside the most recent month range)
Enter the specified event type (optional) to archive an event type separately
Event type for project archiving Select S3
Enter the accesskeyid for s3
Enter the secretAccessKey (managed in the S3 IAM service)
Specify bucket (opens new window)area code
Enter bucket name
Select the storage type (opens new window)(standard mode by default). If the storage type is GLACIER and DEEP_ARCHIVE, please do the data thawing operation in S3 in advance, otherwise the data is not allowed to be pulled
The archived target directory, (the directory will be created under the target bucket, and the archived data will be placed in the directory)

Note: When entering parameters, ensure that the bucket name and directory path are consistent with the archive.

# 2.2.2 HDFS method

# 2.2.2.1 environmental preparation

Prepare HDFS environment interworking with TA cluster network.

# 2.2.2.2 command sample

[ta@ta1 log]$ ta-tool data_reload start
Please enter JobId for this job without background random generation>
------------------------------------------------------------
Please enter the project appid that needs to be archived> 5487*******************9bb
------------------------------------------------------------
Please enter the start time for project archiving：YYYY-MM-DD > 2018-01-01
------------------------------------------------------------
Please enter the end time of project archiving：YYYY-MM-DD > 2018-12-31
------------------------------------------------------------
Please enter the event type for the project archive (not required)>
------------------------------------------------------------
Please enter the type of archive storage：hdfs or rsync or s3 > hdfs
------------------------------------------------------------
Please enter the HFDS URL address for the project archive> hdfs-nm-url
------------------------------------------------------------
Please enter the target directory for the project archive> hdfs******test
------------------------------------------------------------

# 2.2.2.3 step description

Enter jobid, which can be customized, or generated in the background, in order to specify jobid when the task fails and reruns.
Enter the project appid
Enter the start date (outside the most recent month range)
Enter the end date (outside the most recent month range)
Enter the specified event type (optional) to archive a single event type
Event type for project archiving Select hdfs
Enter the hdfs address of the writing end, if the port defaults to fill in the hostname
Enter the user name of hdfs on the write side
Enter the archived target directory

Note: When entering parameters, it is guaranteed to be consistent with the directory path when archiving.

# 2.2.3 rsync method

# 2.2.3.1 environmental preparation

Use rsync daemon mode to build a good server level, and copy the secret key file to the command running node in the TA cluster

# 2.2.3.2 command sample

[ta@ta1 log]$ ta-tool data_reload start
请输入本次工作的JobId,不输入后台随机生成>
------------------------------------------------------------
请输入需要归档的项目appid> 54****************9bb
------------------------------------------------------------
请输入项目归档的开始时间：YYYY-MM-DD > 2018-01-01
------------------------------------------------------------
请输入项目归档的结束时间：YYYY-MM-DD > 2018-12-31
------------------------------------------------------------
请输入项目归档的事件类型（非必填）>
------------------------------------------------------------
请输入归档存储的类型：hdfs or rsync or s3 > rsync
------------------------------------------------------------
请输入目标RSYNC服务器IP地址> rsyncIp
------------------------------------------------------------
请输入目标RSYNC服务器端口> rsyncPort
------------------------------------------------------------
请输入目标RSYNC服务器用户名> rsyncUser
------------------------------------------------------------
请输入目标RSYNC服务器秘钥文件位置> passwordFilePath
------------------------------------------------------------
请输入目标RSYNC服务器模块名称> modelName
------------------------------------------------------------
sending incremental file list
/tmp/
/tmp/d41d8cd98f00b204e9800998ecf8427e.data
sent 99 bytes  received 15 bytes  20.73 bytes/sec
total size is 11  speedup is 0.10 (DRY RUN)
请输入项目归档的目标目录> rsync******test_dir

# 2.2.3.3 step description

Enter jobid, which can be customized, or generated in the background, in order to specify jobid when the task fails and reruns.
Enter the project appid
Enter the start date (outside the most recent month range)
Enter the end date (outside the most recent month range)
Enter the specified event type (optional) to archive an event type separately
Type of archive storage Select rsync
Enter rsync server level ip
Enter rsync server port
Enter tsync username
Enter the file location of the rsync key and put it in a certain directory. The file permissions must be guaranteed to be chmod 600 permissions
Enter the module name of rsync (this step will use the information previously entered to verify whether rsync is available)
Enter the archived target directory

Note: When entering parameters, it is guaranteed to be consistent with the directory path when archiving.

← External User Attribute Association Import Function TaDataWriter Plug-ins →