# Data Deduplication Tool

# I. Introduction

The data deduplication tool is mainly used to deduplicate the duplicate event data in the TA system, and supports the deduplication of event data according to the time period and event type.

Since deduplication will take up cluster computing resources, it is recommended to deduplicate abnormal data only, not to deduplicate data frequently. Please use this tool carefully.

# II. Instructions for Use

The data duplication tool is only available to the users of privatized services. root logs into any server of the privatized cluster and execute su - ta

Then executeta-tool dupevent_del to enter the data deduplication tool interface.

# 2.1 Fill in the appid for the item to be processed

The appid of the project can be queried on the project management page in the TA background.

# 2.2 Confirm project name

After entering, the project name of the project to be deduplicated will be prompted. Enter 'y' to confirm, and enter 'n' to cancel the operation.

# 2.3 Fill in the name of the event that needs to be deduplicated

Next, you need to enter the event name of the event to be deleted. The event name entered here is the key value when transmitting data, not the display name. You can query the event name on the metadata management page, and split multiple events to be deduplicated by ",". After entering, you will be prompted with the name of the event to be deduplicated.

If you do not enter any characters and directly press Enter, all event data will be deduplicated:

# 2.4 Fill in column names ignored in deduplication logic

Next, you need to enter the column names ignored in deduplication logic, the ta self-defined fields have been removed to participate in repeating logic judgment by default. For example, "#server_time" and "#kafka_offset" fields do not participate in repeating judgment logic. If ignored, multiple fields are split by "," . After entering, you will be prompted with the name of field to be ignored.

# 2.5 Fill in the time range for dedupliating event data

Next, you need to enter the time period for deduplicating data. The optional time granularity is "days". Please enter the date in the yyyy-MM-dd format. This field is required.

# 2.6 Final confirmation

Finally, before deduplicating data, the final confirmation will be made, including the name of the deduplicated item, the name of the deduplicated event and the time period for deduplication. Enter 'y' to start deduplicating data. If there is an error, you can enter 'n' to exit the tool and re-enter:

# 2.7 Complete execution process

After confirmation, the data will be deduplicated, and the screenshot of the whole deduplication process is shown in the following figure:

# III. Precautions

# 1 Before using the data deduplication tool, please confirm the cause of data duplication to avoid duplicate data entering at the same time, and the deduplication effect cannot be guaranteed.

# 2 Deduplication requires cluster computing resources and is not recommended for frequent use.

# 3 If the following screenshot occurs, the cluster is merging data, and it can wait for its own automatic execution. If it is stuck for a long time, you can contact the O&M personnel for troubleshooting.

← Metadata Management Tool Custom Table Data Import Function →