Certified Data Engineer Associate v1.0

Page:    1 / 3   
Exam contains 49 questions

Which of the following commands can be used to write data into a Delta table while avoiding the writing of duplicate records?

  • A. DROP
  • B. IGNORE
  • C. MERGE
  • D. APPEND
  • E. INSERT


Answer : C

A data engineer needs to apply custom logic to string column city in table stores for a specific use case. In order to apply this custom logic at scale, the data engineer wants to create a SQL user-defined function (UDF).
Which of the following code blocks creates this SQL UDF?

  • A.
  • B.
  • C.
  • D.
  • E.


Answer : E

A data analyst has a series of queries in a SQL program. The data analyst wants this program to run every day. They only want the final query in the program to run on Sundays. They ask for help from the data engineering team to complete this task.
Which of the following approaches could be used by the data engineering team to complete this task?

  • A. They could submit a feature request with Databricks to add this functionality.
  • B. They could wrap the queries using PySpark and use Python’s control flow system to determine when to run the final query.
  • C. They could only run the entire program on Sundays.
  • D. They could automatically restrict access to the source table in the final query so that it is only accessible on Sundays.
  • E. They could redesign the data model to separate the data used in the final query into a new table.


Answer : B

A data engineer runs a statement every day to copy the previous day’s sales into the table transactions. Each day’s sales are in their own file in the location "/transactions/raw".
Today, the data engineer runs the following command to complete this task:

After running the command today, the data engineer notices that the number of records in table transactions has not changed.
Which of the following describes why the statement might not have copied any new records into the table?

  • A. The format of the files to be copied were not included with the FORMAT_OPTIONS keyword.
  • B. The names of the files to be copied were not included with the FILES keyword.
  • C. The previous day’s file has already been copied into the table.
  • D. The PARQUET file format does not support COPY INTO.
  • E. The COPY INTO statement requires the table to be refreshed to view the copied rows.


Answer : C

A data engineer needs to create a table in Databricks using data from their organization’s existing SQLite database.
They run the following command:

Which of the following lines of code fills in the above blank to successfully complete the task?

  • A. org.apache.spark.sql.jdbc
  • B. autoloader
  • C. DELTA
  • D. sqlite
  • E. org.apache.spark.sql.sqlite


Answer : E

A data engineering team has two tables. The first table march_transactions is a collection of all retail transactions in the month of March. The second table april_transactions is a collection of all retail transactions in the month of April. There are no duplicate records between the tables.
Which of the following commands should be run to create a new table all_transactions that contains all records from march_transactions and april_transactions without duplicate records?

  • A. CREATE TABLE all_transactions AS
    SELECT * FROM march_transactions
    INNER JOIN SELECT * FROM april_transactions;
  • B. CREATE TABLE all_transactions AS
    SELECT * FROM march_transactions
    UNION SELECT * FROM april_transactions;
  • C. CREATE TABLE all_transactions AS
    SELECT * FROM march_transactions
    OUTER JOIN SELECT * FROM april_transactions;
  • D. CREATE TABLE all_transactions AS
    SELECT * FROM march_transactions
    INTERSECT SELECT * from april_transactions;
  • E. CREATE TABLE all_transactions AS
    SELECT * FROM march_transactions
    MERGE SELECT * FROM april_transactions;


Answer : B

A data engineer only wants to execute the final block of a Python program if the Python variable day_of_week is equal to 1 and the Python variable review_period is True.
Which of the following control flow statements should the data engineer use to begin this conditionally executed code block?

  • A. if day_of_week = 1 and review_period:
  • B. if day_of_week = 1 and review_period = "True":
  • C. if day_of_week == 1 and review_period == "True":
  • D. if day_of_week == 1 and review_period:
  • E. if day_of_week = 1 & review_period: = "True":


Answer : C

A data engineer is attempting to drop a Spark SQL table my_table. The data engineer wants to delete all table metadata and data.
They run the following command:

DROP TABLE IF EXISTS my_table -
While the object no longer appears when they run SHOW TABLES, the data files still exist.
Which of the following describes why the data files still exist and the metadata files were deleted?

  • A. The table’s data was larger than 10 GB
  • B. The table’s data was smaller than 10 GB
  • C. The table was external
  • D. The table did not have a location
  • E. The table was managed


Answer : C

A data engineer wants to create a data entity from a couple of tables. The data entity must be used by other data engineers in other sessions. It also must be saved to a physical location.
Which of the following data entities should the data engineer create?

  • A. Database
  • B. Function
  • C. View
  • D. Temporary view
  • E. Table


Answer : C

A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level.
Which of the following tools can the data engineer use to solve this problem?

  • A. Unity Catalog
  • B. Data Explorer
  • C. Delta Lake
  • D. Delta Live Tables
  • E. Auto Loader


Answer : C

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.
The table is configured to run in Production mode using the Continuous Pipeline Mode.
Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?

  • A. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.
  • B. All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.
  • C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.
  • D. All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.
  • E. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.


Answer : E

In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?

  • A. Checkpointing and Write-ahead Logs
  • B. Structured Streaming cannot record the offset range of the data being processed in each trigger.
  • C. Replayable Sources and Idempotent Sinks
  • D. Write-ahead Logs and Idempotent Sinks
  • E. Checkpointing and Idempotent Sinks


Answer : E

Which of the following describes the relationship between Gold tables and Silver tables?

  • A. Gold tables are more likely to contain aggregations than Silver tables.
  • B. Gold tables are more likely to contain valuable data than Silver tables.
  • C. Gold tables are more likely to contain a less refined view of data than Silver tables.
  • D. Gold tables are more likely to contain more data than Silver tables.
  • E. Gold tables are more likely to contain truthful data than Silver tables.


Answer : C

Which of the following describes the relationship between Bronze tables and raw data?

  • A. Bronze tables contain less data than raw data files.
  • B. Bronze tables contain more truthful data than raw data.
  • C. Bronze tables contain aggregates while raw data is unaggregated.
  • D. Bronze tables contain a less refined view of data than raw data.
  • E. Bronze tables contain raw data with a schema applied.


Answer : C

Which of the following tools is used by Auto Loader process data incrementally?

  • A. Checkpointing
  • B. Spark Structured Streaming
  • C. Data Explorer
  • D. Unity Catalog
  • E. Databricks SQL


Answer : B

Page:    1 / 3   
Exam contains 49 questions

Talk to us!


Have any questions or issues ? Please dont hesitate to contact us

Certlibrary.com is owned by MBS Tech Limited: Room 1905 Nam Wo Hong Building, 148 Wing Lok Street, Sheung Wan, Hong Kong. Company registration number: 2310926
Certlibrary doesn't offer Real Microsoft Exam Questions. Certlibrary Materials do not contain actual questions and answers from Cisco's Certification Exams.
CFA Institute does not endorse, promote or warrant the accuracy or quality of Certlibrary. CFA® and Chartered Financial Analyst® are registered trademarks owned by CFA Institute.
Terms & Conditions | Privacy Policy