Chapter 7. Mirroring Data with Falcon
You can mirror data between on-premise clusters or between an on-premises HDFS cluster and a cluster in the cloud using Microsoft Azure or Amazon S3.
Prepare to Mirror Data
Mirroring data produces an exact copy of the data and keeps both copies synchronized. You can use Falcon to mirror HDFS directories, Hive tables, and snapshots.
Before creating a mirror, complete the following actions:
- Set permissions to allow read, write, and execute access to the source and target directories. - You must be logged in as the owner of the directories. - Example: If the source directory were in - /user/ambari-qa/falcon, type the following:- [bash ~]$ su - root [root@bash ~]$ su - ambari-qa [ambari-qa@bash ~]$ hadoop fs -chmod 755 /user/ambari-qa/falcon/ 
- Create the source and target cluster entity definitions, if they do not exist. - See "Creating a Cluster Entity Definition" in Creating Falcon Entity Definitions for more information. 
- For snapshot mirroring, you must also enable the snapshot capability on the source and target directories. - You must be logged in as the HDFS Service user and the source and target directories must be owned by the user submitting the job. - For example: - [ambari-qa@bash ~]$ su - hdfs ## Run the following command on the target cluster [hdfs@bash ~]$ hdfs dfsadmin -allowSnapshot /apps/falcon/snapshots/target ## Run the following command on the source cluster [hdfs@bash ~]$ hdfs dfsadmin -allowSnapshot /apps/falcon/snapshots/source 
Mirror File System Data Using the Web UI
You can use the Falcon web UI to quickly define a mirror job and start a mirror job on HDFS.
Prerequisites
Your environment must meet the HDP versioning requirements described in "Replication Between HDP Versions" in Creating Falcon Entity Definitions.
Steps
- Ensure that you have set permissions correctly and defined required entities as described in Preparing to Mirror Data. 
- At the top of the Falcon web UI page, click Create > Mirror > File System. 
- On the New HDFS Mirror page, specify the values for the following properties: - Table 7.1. General HDFS Mirror Properties - Property - Description - Mirror Name - Name of the mirror job. The naming criteria are as follows: - Must be unique 
- Must start with a letter 
- Is case sensitive 
- Can contain a maximum of 40 characters 
- Can include numbers 
- Can use a dash (-) but no other special characters 
- Cannot contain spaces 
 - Tags - Enter the key/value pair for metadata tagging to assist in entity search in Falcon. The criteria are as follows: - Can contain 1 to 100 characters 
- Can include numbers 
- Can use a dash (-) but no other special characters 
- Cannot contain spaces 
 - Table 7.2. Source and Target Mirror Properties - Property - Description - Source Location - Specify whether the source data is local on HDFS or on Microsoft Azure or Amazon S3 in the cloud. If your target is Azure or S3, you can only use HDFS for the source. - Source Cluster - Select an existing cluster entity. - Source Path - Enter the path to the source data. - Target Location - Specify whether the mirror target is local on HDFS or on Microsoft Azure or Amazon S3 in the cloud. If your target is Azure or S3, you can only use HDFS for the source. - Target Cluster - Select an existing cluster entity to serve as target for the mirrored data. - Target Path - Enter the path to the directory that will contain the mirrored data. - Run job here - Choose whether to execute the job on the source or on the target cluster. - Validity Startand End - Combined with the frequency value to determine the window of time in which a Falcon mirror job can execute. The workflow job starts executing after the schedule time and when all the inputs are available. The workflow ends before the specified end time, so there is not a workflow instance at end time. Also known as run duration. - Frequency - How often the process is generated. Valid frequency types are minutes, hours, days, and months. - Timezone - The timezone is associated with the start and end times. Default timezone is UTC. - Send alerts to - A comma-separated list of email addresses to which alerts are sent, in the format name@company.com. - Table 7.3. Advanced HDFS Mirror Properties - Property - Description - Max Maps for DistCp - The maximum number of maps used during replication. This setting impacts performance and throttling. Default is 5. - Max Bandwidth (MB) - The bandwidth in MB/s used by each mapper during replication. This setting impacts performance and throttling. Default is 100 MB. - Retry Policy - Defines how the workflow failures should be handled. Options are Periodic, Exponential Backoff, and Final. - Delay - The time period after which a retry attempt is made. For example, an Attempt value of 3 and Delay value of 10 minutes would cause the workflow retry to occur after 10 minutes, 20 minutes, and 30 minutes after the start time of the workflow. Default is 30 minutes. - Attempts - How many times the retry policy should be implemented before the job fails. Default is 3. - Access Control List - Specify the HDFS owner, group, and access permissions for the cluster. Default permissions are 755 (rwx/r-x/r-x). 
- Click Next to view a summary of your entity definition. 
- (Optional) Click Preview XML to review or edit the entity definition in XML. 
- After verifying the entity definition, click Save. - The entity is automatically submitted for verification, but it is not scheduled to run. 
- Verify that you successfully created the entity. - Type the entity name in the Falcon web UI Search field and press Enter. 
- If the entity name appears in the search results, it was successfully created. - For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features. 
 
- Schedule the entity. - In the search results, click the checkbox next to an entity name with status of - Submitted.
- Click Schedule. - After a few seconds a success message displays. 
 
Mirror Hive Data Using the Web UI
You can quickly mirror Apache Hive databases or tables between source and target clusters with HiveServer2 endpoints. You can also enable TDE encryption for your mirror.
- Ensure that you have set permissions correctly and defined required entities as described in Preparing to Mirror Data. 
- At the top of the Falcon web UI page, click Create > Mirror > Hive. 
- On the New Hive Mirror page, specify the values for the following properties: - Table 7.4. General Hive Mirror Properties - Property - Description - Details - Mirror Name - Name of the mirror job. - The naming criteria are as follows: - Must be unique 
- Must start with a letter 
- Is case sensitive 
- Can contain 2 to 40 characters 
- Can include numbers 
- Can use a dash (-) but no other special characters 
- Cannot contain spaces 
 - Tags - Enter the key/value pair for metadata tagging to assist in entity search in Falcon. - The criteria are as follows: - Can contain 1 to 100 characters 
- Can include numbers 
- Can use a dash (-) but no other special characters 
- Cannot contain spaces 
 - Table 7.5. Source and Target Hive Mirror Properties - Property - Description - Details - Cluster, Source & Target - Select existing cluster entities, one to serve as source for the mirrored data and one to serve as target for the mirrored data. - Cluster entities must be available in Falcon before a mirror job can be created. - HiveServer2 Endpoint, Source & Target - Enter the location of data to be mirrored on the source and the location of the mirrored data on the target. - The format is hive2:// - localhost:1000.- Hive2 Kerberos Principal, Source & Target - This field is automatically populated with the value of the service principal for the metastore Thrift server. - The value is displayed in Ambari at - Hive > Config > Advanced > Advanced hive-site > hive.metastore.kerberos.principaland must be unique.- Meta Store URI, Source & Target - Used by the metastore client to connect to the remote metastore. - The value is displayed in Ambari at - Hive > Config > Advanced > General > hive.metastore.uris.- Kerberos Principal, Source & Target - The field is automatically populated. - Property=dfs.namenode.kerberos.principal and Value=nn/_HOST@EXAMPLE.COM and must be unique. - Run job here - Choose whether to execute the job on the source cluster or on the target cluster. - None - I want to copy - Select to copy one or more Hive databases or copy one or more tables from a single database. You must identify the specific databases and tables to be copied. - None - Validity Startand End - Combined with the frequency value to determine the window of time in which a Falcon mirror job can execute. - The workflow job starts executing after the schedule time and when all the inputs are available. The workflow ends before the specified end time, so there is not a workflow instance at end time. Also known as run duration. - Frequency - Determines how often the process is generated. - Valid frequency types are minutes, hours, days, and months. - Timezone - The timezone is associated with the validity start and end times. - Default timezone is UTC. - Send alerts to - A comma-separated list of email addresses to which alerts are sent. - The format is name@xyz.com. - Table 7.6. Advanced Hive Mirror Properties - Property - Description - Details - TDE Encryption - Enables encryption of data at rest. - See "Enabling Transparent Data Encryption" in Using Advanced Features for more information. - Max Maps for DistCp - The maximum number of maps used during replication. - This setting impacts performance and throttling. Default is 5. - Max Bandwidth (MB) - The bandwidth in MB/s used by each mapper during replication. - This setting impacts performance and throttling. Default is 100 MB. - Retry Policy - Defines how the workflow failures should be handled. - Options are Periodic, Exponential Backoff, and Final. - Delay - The time period after which a retry attempt is made. - For example, an Attempt value of 3 and Delay value of 10 minutes would cause the workflow retry to occur after 10 minutes, 20 minutes, and 30 minutes after the start time of the workflow. Default is 30 minutes. - Attempts - How many times the retry policy should be implemented before the job fails. - Default is 3. - Access Control List - Specify the HDFS owner, group, and access permissions for the cluster. - Default permissions are 755 (rwx/r-x/r-x). 
- Click Next to view a summary of your entity definition. 
- (Optional) Click Preview XML to review or edit the entity definition in XML. 
- After verifying the entity definition, click Save. - The entity is automatically submitted for verification, but it is not scheduled to run. 
- Verify that you successfully created the entity. - Type the entity name in the Falcon web UI Search field and press Enter. 
- If the entity name appears in the search results, it was successfully created. - For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features. 
 
- Schedule the entity to run the mirror job. - In the search results, click the checkbox next to an entity name with status of - Submitted.
- Click Schedule. - After a few seconds a success message displays. 
 
Mirror Data Using Snapshots
Snapshot-based mirroring is an efficient data backup method because only updated content is actually transferred during the mirror job. You can mirror snapshots from a single source directory to a single target directory. The destination directory is the target for the backup job.
Prerequisites
- Source and target clusters must run Hadoop 2.7.0 or higher. - Falcon does not validate versions. 
- Source and target clusters should both be either secure or unsecure. - This is a recommendation, not a requirement. 
- Source and target clusters must have snapshot capability enabled (the default is "enabled"). 
- The user submitting the mirror job must have access permissions on both the source and target clusters. 
To mirror snapshot data with the Falcon web UI:
- Ensure that you have set permissions correctly, enabled snapshot mirroring, and defined required entities as described in Preparing to Mirror Data. 
- At the top of the Falcon web UI page, click Create > Mirror > Snapshot. 
- On the New Snapshot Based Mirror page, specify the values for the following properties: - Table 7.7. Source and Target Snapshot Mirror Properties - Property - Description - Source, Cluster - Select an existing source cluster entity. At least one cluster entity must be available in Falcon. - Target, Cluster - Select an existing target cluster entity. At least one cluster entity must be available in Falcon. - Source, Directory - Enter the path to the source data. - Source, Delete Snapshot After - Specify the time period after which the mirrored snapshots are deleted from the source cluster. Snapshots are retained past this date if the number of snapshots is less than the Keep Last setting. - Source, Keep Last - Specify the number of snapshots to retain on the source cluster, even if the delete time has been reached. Upon reaching the number specified, the oldest snapshot is deleted when the next job is run. - Target, Directory - Enter the path to the location on the target cluster in which the snapshot is stored. - Target, Delete Snapshot After - Specify the time period after which the mirrored snapshots are deleted from the target cluster. Snapshots are retained past this date if the number of snapshots is less than the Keep Last setting. - Target, Keep Last - Specify the number of snapshots to retain on the target cluster, even if the delete time has been reached. Upon reaching the number specified, the oldest snapshot is deleted when the next job is run. - Run job here - Choose whether to execute the job on the source or on the target cluster. - Run Duration Startand End - Combined with the frequency value to determine the window of time in which a Falcon mirror job can execute. The workflow job starts executing after the schedule time and when all the inputs are available. The workflow ends before the specified end time, so there is not a workflow instance at end time. Also known as validity time. - Frequency - How often the process is generated. Valid frequency types are minutes, hours, days, and months. - Timezone - Default timezone is UTC. - Table 7.8. Advanced Snapshot Mirror Properties - Property - Description - TDE Encryption - Enable to encrypt data at rest. See "Enabling Transparent Data Encryption" in Using Advanced Features for more information. - Retry Policy - Defines how the workflow failures should be handled. Options are Periodic, Exponential Backup, and Final. - Delay - The time period after which a retry attempt is made. For example, an Attempt value of 3 and Delay value of 10 minutes would cause the workflow retry to occur after 10 minutes, 20 minutes, and 30 minutes after the start time of the workflow. Default is 30 minutes. - Attempts - How many times the retry policy should be implemented before the job fails. Default is 3. - Max Maps - The maximum number of maps used during DistCp replication. This setting impacts performance and throttling. Default is 5. - Max Bandwidth (MB) - The bandwidth in MB/s used by each mapper during replication. This setting impacts performance and throttling. Default is 100 MB. - Send alerts to - A comma-separated list of email addresses to which alerts are sent, in the format name@xyz.com. - Access Control List - Specify the HDFS owner, group, and access permissions for the cluster. Default permissions are 755 (rwx/r-x/r-x). 
- Click Next to view a summary of your entity definition. 
- (Optional) Click Preview XML to review or edit the entity definition in XML. 
- After verifying the entity definition, click Save. - The entity is automatically submitted for verification, but it is not scheduled to run. 
- Verify that you successfully created the entity. - Type the entity name in the Falcon web UI Search field and press Enter. 
- If the entity name appears in the search results, it was successfully created. - For more information about the search function, see "Locating and Managing Entities" in Using Advanced Falcon Features. 
 
- Schedule the entity. - In the search results, click the checkbox next to an entity name with status of - Submitted.
- Click Schedule. - After a few seconds a success message displays. 
 
Mirror File System Data Using the CLI
See the Apache Falcon website for information about using the CLI to mirror data.

