Create the Feed Entity
The feed entity defines the data set that Falcon replicates. Reference your cluster entities to determine which clusters the feed uses.
- Create an XML file for the Feed entity. - <?xml version="1.0"?> 
- Describe the feed. - <?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> </feed> 
- Specify the frequency of the feed. - <?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> </feed> 
- Choose a retention policy for the data to remain on the cluster. - For example: - <?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> </feed> 
- (Optional) Set a late-arrival cut-off policy. The supported policies for late data handling are backoff, exp-backoff (default), and final. - For example, to set the policy to a late cutoff of 6 hours: - <?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> <!-- Late arrival cut-off --> <late-arrival cut-off="hours(6)”/> </feed> 
- Define your source and target clusters for the feed. - For example, for two clusters, MyDataCenter and MyDataCenter-secondary cluster: - <?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> <!-- Late arrival cut-off --> <late-arrival cut-off="hours(6)”/> <!-- Target clusters for retention and replication. --> <clusters> <cluster name="<MyDataCenter>" type="source"> <validity start="$date" end="$date"/> <!--Currently delete is the only action available --> <retention limit="days($n)" action="delete"> </cluster> <cluster name="$MyDataCenter-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit="days(7)" action="delete"/> </cluster> </clusters> </feed>
- Specify the HDFS weblogs path locations or Hive table locations. For example to specify the HDFS weblogs location: - <?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> <!-- Late arrival cut-off --> <late-arrival cut-off="hours(6)”/> <!-- Target clusters for retention and replication. --> <clusters> <cluster name="<MyDataCenter>" type="source"> <validity start="$date" end="$date"/> <!--Currently delete is the only action available --> <retention limit="days($n)" action="delete"> </cluster> <cluster name="$MyDataCenter-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit="days(7)" action="delete"/> </cluster> </clusters> <locations> <!-- Global location across clusters - HDFS paths or Hive tables --> <location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> </feed>
- Specify HDFS ACLs. Set the owner, group, and level of permissions for HDFS. For example: - <?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> <!-- Late arrival cut-off --> <late-arrival cut-off="hours(6)”/> <!-- Target clusters for retention and replication. --> <clusters> <cluster name="<MyDataCenter>" type="source"> <validity start="$date" end="$date"/> <!--Currently delete is the only action available --> <retention limit="days($n)" action="delete"> </cluster> <cluster name="$MyDataCenter-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit="days(7)" action="delete"/> </cluster> </clusters> <!-- Global location across clusters - HDFS paths or Hive tables --> <locations> <location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <!-- Required for HDFS. --> <ACL owner="hdfs" group="users" permission="0755"/> </feed>
- Specify the location of the schema file for the feed as well as the provider of the schema like protobuf, thrift etc. For example: - <?xml version="1.0"?> <feed description="$rawInputFeed" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <!--Feed run frequency--> <frequency>hours(1)</frequency> <!-- Late arrival cut-off --> <late-arrival cut-off="hours(6)”/> <!-- Target clusters for retention and replication. --> <clusters> <cluster name="<MyDataCenter>" type="source"> <validity start="$date" end="$date"/> <!--Currently delete is the only action available --> <retention limit="days($n)" action="delete"> </cluster> <cluster name="$MyDataCenter-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit="days(7)" action="delete"/> </cluster> </clusters> <!-- Global location across clusters - HDFS paths or Hive tables --> <locations> <location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <!-- Required for HDFS. --> <ACL owner="hdfs" group="users" permission="0755"/> <schema location="/schema" provider="protobuf"/> </feed>

