# データ重複除去ツール
# Data Deduplication Tool
# Introduction
The data deduplication tool is mainly used to deduplicate the duplicate event data in the TA system, and supports the deduplication of event data according to the time period and event type.
Since deduplication will take up cluster computing resources, it is recommended to deduplicate abnormal data only, not to deduplicate data frequently. Please use this tool carefully.
# Instructions for Use
The data duplication tool is only available to the users of privatized services. root logs into any server of the privatized cluster and execute su - ta
Then executeta-tool dupevent_del
to enter the data deduplication tool interface.
# 2.1 Fill in the appid for the item to be processed
The appid of the project can be queried on the project management page in the TA background.
# 2.2 Confirm project name
After entering, the project name of the project to be deduplicated will be prompted. Enter 'y' to confirm, and enter 'n' to cancel the operation.
# 2.3 Fill in the name of the event that needs to be deduplicated
Next, you need to enter the event name of the event to be deleted. The event name entered here is the key value when transmitting data, not the display name. You can query the event name on the metadata management page, and split multiple events to be deduplicated by ",". After entering, you will be prompted with the name of the event to be deduplicated.
If you do not enter any characters and directly press Enter, all event data will be deduplicated:
# 2.4 Fill in column names ignored in deduplication logic
Next, you need to enter the column names ignored in deduplication logic, the ta self-defined fields have been removed to participate in repeating logic judgment by default. For example, "#server_time" and "#kafka_offset" fields do not participate in repeating judgment logic. If ignored, multiple fields are split by "," . After entering, you will be prompted with the name of field to be ignored.
# 2.5 Fill in the time range for dedupliating event data
Next, you need to enter the time period for deduplicating data. The optional time granularity is "days". Please enter the date in the yyyy-MM-dd
format. This field is required.
# 2.6 Final confirmation
Finally, before deduplicating data, the final confirmation will be made, including the name of the deduplicated item, the name of the deduplicated event and the time period for deduplication. Enter 'y' to start deduplicating data. If there is an error, you can enter 'n' to exit the tool and re-enter:
# 2.7 Complete execution process
After confirmation, the data will be deduplicated, and the screenshot of the whole deduplication process is shown in the following figure:
# Precautions
# 1 Before using the data deduplication tool, please confirm the cause of data duplication to avoid duplicate data entering at the same time, and the deduplication effect cannot be guaranteed.
# 2 Deduplication requires cluster computing resources and is not recommended for frequent use.