Certified Data Engineer Associate v1.0

Page:    1 / 13   
Exam contains 181 questions

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.
The cade block used by the data engineer is below:

If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?

  • A. trigger("5 seconds")
  • B. trigger()
  • C. trigger(once="5 seconds")
  • D. trigger(processingTime="5 seconds")
  • E. trigger(continuous="5 seconds")


Answer : D

A dataset has been defined using Delta Live Tables and includes an expectations clause:
CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION DROP ROW
What is the expected behavior when a batch of data containing data that violates these constraints is processed?

  • A. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.
  • B. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.
  • C. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
  • D. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.
  • E. Records that violate the expectation cause the job to fail.


Answer : C

Which of the following describes when to use the CREATE STREAMING LIVE TABLE (formerly CREATE INCREMENTAL LIVE TABLE) syntax over the CREATE LIVE TABLE syntax when creating Delta Live Tables (DLT) tables using SQL?

  • A. CREATE STREAMING LIVE TABLE should be used when the subsequent step in the DLT pipeline is static.
  • B. CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally.
  • C. CREATE STREAMING LIVE TABLE is redundant for DLT and it does not need to be used.
  • D. CREATE STREAMING LIVE TABLE should be used when data needs to be processed through complicated aggregations.
  • E. CREATE STREAMING LIVE TABLE should be used when the previous step in the DLT pipeline is static.


Answer : B

A data engineer has joined an existing project and they see the following query in the project repository:

CREATE STREAMING LIVE TABLE loyal_customers AS

SELECT customer_id -
FROM STREAM(LIVE.customers)
WHERE loyalty_level = 'high';

Which of the following describes why the STREAM function is included in the query?

  • A. The STREAM function is not needed and will cause an error.
  • B. The data in the customers table has been updated since its last run.
  • C. The customers table is a streaming live table.
  • D. The customers table is a reference to a Structured Streaming query on a PySpark DataFrame.


Answer : C

How can Git operations must be performed outside of Databricks Repos?

  • A. Commit
  • B. Pull
  • C. Merge
  • D. Clone


Answer : C

A data engineer has three tables in a Delta Live Tables (DLT) pipeline. They have configured the pipeline to drop invalid records at each table. They notice that some data is being dropped due to quality concerns at some point in the DLT pipeline. They would like to determine at which table in their pipeline the data is being dropped.
Which of the following approaches can the data engineer take to identify the table that is dropping the records?

  • A. They can set up separate expectations for each table when developing their DLT pipeline.
  • B. They cannot determine which table is dropping the records.
  • C. They can set up DLT to notify them via email when records are dropped.
  • D. They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.
  • E. They can navigate to the DLT pipeline page, click on the “Error” button, and review the present errors.


Answer : D

A data engineer has a single-task Job that runs each morning before they begin working. After identifying an upstream data issue, they need to set up another task to run a new notebook prior to the original task.
Which of the following approaches can the data engineer use to set up the new task?

  • A. They can clone the existing task in the existing Job and update it to run the new notebook.
  • B. They can create a new task in the existing Job and then add it as a dependency of the original task.
  • C. They can create a new task in the existing Job and then add the original task as a dependency of the new task.
  • D. They can create a new job from scratch and add both tasks to run concurrently.
  • E. They can clone the existing task to a new Job and then edit it to run the new notebook.


Answer : B

An engineering manager wants to monitor the performance of a recent project using a Databricks SQL query. For the first week following the project’s release, the manager wants the query results to be updated every minute. However, the manager is concerned that the compute resources used for the query will be left running and cost the organization a lot of money beyond the first week of the project’s release.
Which of the following approaches can the engineering team use to ensure the query does not cost the organization any money beyond the first week of the project’s release?

  • A. They can set a limit to the number of DBUs that are consumed by the SQL Endpoint.
  • B. They can set the query’s refresh schedule to end after a certain number of refreshes.
  • C. They cannot ensure the query does not cost the organization money beyond the first week of the project’s release.
  • D. They can set a limit to the number of individuals that are able to manage the query’s refresh schedule.
  • E. They can set the query’s refresh schedule to end on a certain date in the query scheduler.


Answer : E

A data engineering team has two tables. The first table march_transactions is a collection of all retail transactions in the month of March. The second table april_transactions is a collection of all retail transactions in the month of April. There are no duplicate records between the tables.

Which of the following commands should be run to create a new table all_transactions that contains all records from march_transactions and april_transactions without duplicate records?

  • A. CREATE TABLE all_transactions AS
    SELECT * FROM march_transactions
    INNER JOIN SELECT * FROM april_transactions;
  • B. CREATE TABLE all_transactions AS
    SELECT * FROM march_transactions
    UNION SELECT * FROM april_transactions;
  • C. CREATE TABLE all_transactions AS
    SELECT * FROM march_transactions
    OUTER JOIN SELECT * FROM april_transactions;
  • D. CREATE TABLE all_transactions AS
    SELECT * FROM march_transactions
    INTERSECT SELECT * from april_transactions;


Answer : B

A data engineer wants to schedule their Databricks SQL dashboard to refresh once per day, but they only want the associated SQL endpoint to be running when it is necessary.
Which of the following approaches can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard?

  • A. They can ensure the dashboard’s SQL endpoint matches each of the queries’ SQL endpoints.
  • B. They can set up the dashboard’s SQL endpoint to be serverless.
  • C. They can turn on the Auto Stop feature for the SQL endpoint.
  • D. They can reduce the cluster size of the SQL endpoint.
  • E. They can ensure the dashboard’s SQL endpoint is not one of the included query’s SQL endpoint.


Answer : C

In which of the following scenarios should a data engineer select a Task in the Depends On field of a new Databricks Job Task?

  • A. When another task needs to be replaced by the new task
  • B. When another task needs to successfully complete before the new task begins
  • C. When another task has the same dependency libraries as the new task
  • D. When another task needs to use as little compute resources as possible


Answer : B

Which of the following must be specified when creating a new Delta Live Tables pipeline?

  • A. A key-value pair configuration
  • B. At least one notebook library to be executed
  • C. A path to cloud storage location for the written data
  • D. A location of a target database for the written data


Answer : B

A data engineer has a Job with multiple tasks that runs nightly. Each of the tasks runs slowly because the clusters take a long time to start.
Which of the following actions can the data engineer perform to improve the start up time for the clusters used for the Job?

  • A. They can use endpoints available in Databricks SQL
  • B. They can use jobs clusters instead of all-purpose clusters
  • C. They can configure the clusters to be single-node
  • D. They can use clusters that are from a cluster pool
  • E. They can configure the clusters to autoscale for larger data sizes


Answer : D

A new data engineering team team. has been assigned to an ELT project. The new data engineering team will need full privileges on the database customers to fully manage the project.
Which of the following commands can be used to grant full permissions on the database to the new data engineering team?

  • A. GRANT USAGE ON DATABASE customers TO team;
  • B. GRANT ALL PRIVILEGES ON DATABASE team TO customers;
  • C. GRANT SELECT PRIVILEGES ON DATABASE customers TO teams;
  • D. GRANT SELECT CREATE MODIFY USAGE PRIVILEGES ON DATABASE customers TO team;
  • E. GRANT ALL PRIVILEGES ON DATABASE customers TO team;


Answer : E

A new data engineering team has been assigned to work on a project. The team will need access to database customers in order to see what tables already exist. The team has its own group team.
Which of the following commands can be used to grant the necessary permission on the entire database to the new team?

  • A. GRANT VIEW ON CATALOG customers TO team;
  • B. GRANT CREATE ON DATABASE customers TO team;
  • C. GRANT USAGE ON CATALOG team TO customers;
  • D. GRANT CREATE ON DATABASE team TO customers;
  • E. GRANT USAGE ON DATABASE customers TO team;


Answer : E

Page:    1 / 13   
Exam contains 181 questions

Talk to us!


Have any questions or issues ? Please dont hesitate to contact us

Certlibrary.com is owned by MBS Tech Limited: Room 1905 Nam Wo Hong Building, 148 Wing Lok Street, Sheung Wan, Hong Kong. Company registration number: 2310926
Certlibrary doesn't offer Real Microsoft Exam Questions. Certlibrary Materials do not contain actual questions and answers from Cisco's Certification Exams.
CFA Institute does not endorse, promote or warrant the accuracy or quality of Certlibrary. CFA® and Chartered Financial Analyst® are registered trademarks owned by CFA Institute.
Terms & Conditions | Privacy Policy