Use Databricks-Certified-Professional-Data-Engineer Exam Dumps (2023 PDF Dumps) To Have Reliable Databricks-Certified-Professional-Data-Engineer Test Engine [Q19-Q38]

NO.19 Which of the statement is correct about the cluster pools?

Cluster pools allow you to perform load balancing

Cluster pools allow you to create a cluster

Cluster pools allow you to save time when starting a new cluster

Cluster pools are used to share resources among multiple teams

Cluster pools allow you to have all the nodes in the cluster from single physical server rack

NO.20 You are currently working on a project that requires the use of SQL and Python in a given note-book, what would be your approach

Create two separate notebooks, one for SQL and the second for Python

A single notebook can support multiple languages, use the magic command to switch between the two.

Use an All-purpose cluster for python, SQL endpoint for SQL

Use job cluster to run python and SQL Endpoint for SQL

NO.21 Which of the following benefits does Delta Live Tables provide for ELT pipelines over standard data pipelines
that utilize Spark and Delta Lake on Databricks?

The ability to write pipelines in Python and/or SQL

The ability to declare and maintain data table dependencies

The ability to automatically scale compute resources

The ability to access previous versions of data tables

The ability to perform batch and streaming queries

NO.22 A denote the event ‘student is female’ and let B denote the event ‘student is French’. In a class of 100 students
suppose 60 are French, and suppose that 10 of the French students are females. Find the probability that if I
pick a French student, it will be a girl, that is, find P(A|B).

1/3

2/3

1/6

2/6

NO.23 A newly joined team member John Smith in the Marketing team who currently does not have any access to the data requires read access to customers table, which of the following statements can be used to grant access.

GRANT SELECT, USAGE TO [email protected] ON TABLE customers

GRANT READ, USAGE TO [email protected] ON TABLE customers

GRANT SELECT, USAGE ON TABLE customers TO [email protected]

GRANT READ, USAGE ON TABLE customers TO [email protected]

GRANT READ, USAGE ON customers TO [email protected]

NO.24 Which of the following python statement can be used to replace the schema name and table name in the query statement?

1.table_name = “sales”
2.schema_name = “bronze”
3.query = f”select * from schema_name.table_name”

1.table_name = “sales”
2.schema_name = “bronze”
3.query = “select * from {schema_name}.{table_name}”

1.table_name = “sales”
2.schema_name = “bronze”
3.query = f”select * from { schema_name}.{table_name}”

1.table_name = “sales”
2.schema_name = “bronze”
3.query = f”select * from + schema_name +”.”+table_name”

NO.25 Which of the following commands can be used to run one notebook from another notebook?

notebook.utils.run(“full notebook path”)

execute.utils.run(“full notebook path”)

dbutils.notebook.run(“full notebook path”)

only job clusters can run notebook

spark.notebook.run(“full notebook path”)

NO.26 If E1 and E2 are two events, how do you represent the conditional probability given that E2 occurs given that
E1 has occurred?

P(E1)/P(E2)

P(E1+E2)/P(E1)

P(E2)/P(E1)

P(E2)/(P(E1+E2)

NO.27 Which of the following SQL statements can replace python variables in Databricks SQL code, when the notebook is set in SQL mode?
1.%python
2.table_name = “sales”
3.schema_name = “bronze”
4.
5.%sql
6.SELECT * FROM ____________________

SELECT * FROM f{schema_name.table_name}

SELECT * FROM {schem_name.table_name}

SELECT * FROM ${schema_name}.${table_name}

SELECT * FROM schema_name.table_name

NO.28 A table customerLocations exists with the following schema:
1. id STRING,
2. date STRING,
3. city STRING,
4. country STRING
A senior data engineer wants to create a new table from this table using the following command:
1. CREATE TABLE customersPerCountry AS
2. SELECT country,
3. COUNT(*) AS customers
4. FROM customerLocations
5. GROUP BY country;
A junior data engineer asks why the schema is not being declared for the new table. Which of the following
responses explains why declaring the schema is not necessary?

CREATE TABLE AS SELECT statements result in tables that do not support schemas

CREATE TABLE AS SELECT statements assign all columns the type STRING

CREATE TABLE AS SELECT statements adopt schema details from the source table and query

CREATE TABLE AS SELECT statements infer the schema by scanning the data

CREATE TABLE AS SELECT statements result in tables where schemas are optional

NO.29 Which of the following SQL statement can be used to query a table by eliminating duplicate rows from the query results?

SELECT DISTINCT * FROM table_name

SELECT DISTINCT * FROM table_name HAVING COUNT(*) > 1

SELECT DISTINCT_ROWS (*) FROM table_name

SELECT * FROM table_name GROUP BY * HAVING COUNT(*) < 1

SELECT * FROM table_name GROUP BY * HAVING COUNT(*) > 1

NO.30 What is the main difference between the silver layer and the gold layer in medalion architecture?

Silver may contain aggregated data

Gold may contain aggregated data

Data quality checks are applied in gold

Silver is a copy of bronze data

God is a copy of silver data

NO.31 A junior data engineer has ingested a JSON file into a table raw_table with the following schema:
1. cart_id STRING,
2. items ARRAY<item_id:STRING>
The junior data engineer would like to unnest the items column in raw_table to result in a new table with the
following schema:
1.cart_id STRING,
2.item_id STRING
Which of the following commands should the junior data engineer run to complete this task?

1. SELECT cart_id, flatten(items) AS item_id
2. FROM raw_table;

1. SELECT cart_id, reduce(items) AS item_id
2. FROM raw_table;

1. SELECT cart_id, slice(items) AS item_id
2. FROM raw_table;

1. SELECT cart_id, filter(items) AS item_id
2. FROM raw_table;

1. SELECT cart_id, explode(items) AS item_id
2. FROM raw_table;

NO.32 Which of the following is a correct statement on how the data is organized in the storage when when managing a DELTA table?

All of the data is broken down into one or many parquet files, log files are broken down into one or many JSON files, and each transaction creates a new data file(s) and log file.
(Correct)

All of the data and log are stored in a single parquet file

All of the data is broken down into one or many parquet files, but the log file is stored as a single json file, and every transaction creates a new data file(s) and log file gets appended.

All of the data is broken down into one or many parquet files, log file is removed once the transaction is committed.

All of the data is stored into one parquet file, log files are broken down into one or many json files.

NO.33 Which of the following commands results in the successful creation of a view on top of the delta stream(stream on delta table)?

Spark.read.format(“delta”).table(“sales”).createOrReplaceTempView(“streaming_vw”)

Spark.readStream.format(“delta”).table(“sales”).createOrReplaceTempView(“streaming_vw”)

Spark.read.format(“delta”).table(“sales”).mode(“stream”).createOrReplaceTempView(“streaming_vw”)

Spark.read.format(“delta”).table(“sales”).trigger(“stream”).createOrReplaceTempView(“streaming_vw”)

Spark.read.format(“delta”).stream(“sales”).createOrReplaceTempView(“streaming_vw”)

You can not create a view on streaming data source.

NO.34 Which of the following Structured Streaming queries is performing a hop from a Bronze table to a Silver
table?

1. (spark.table(“sales”)
2. .groupBy(“store”)
3. .agg(sum(“sales”))
4. .writeStream
5. .option(“checkpointLocation”, checkpointPath)
6. .outputMode(“complete”)
7. .table(“aggregatedSales”)
8.)

1. (spark.table(“sales”)
2. .withColumn(“avgPrice”, col(“sales”) / col(“units”))
3. .writeStream
4. .option(“checkpointLocation”, checkpointPath)
5. .outputMode(“append”)
6. .table(“cleanedSales”)
7.)

1. (spark.readStream.load(rawSalesLocation)
2. .writeStream
3. .option(“checkpointLocation”, checkpointPath)
4. .outputMode(“append”)
5. .table(“uncleanedSales”)
6. )

1. (spark.table(“sales”)
2. .agg(sum(“sales”),
3. sum(“units”))
4. .writeStream
5. .option(“checkpointLocation”, checkpointPath)
6. .outputMode(“complete”)
7. .table(“aggregatedSales”)
8. )

1. (spark.read.load(rawSalesLocation)
2. .writeStream
3. .option(“checkpointLocation”, checkpointPath)
4. .outputMode(“append”)
5. .table(“uncleanedSales”)
6. )

NO.35 How do you handle failures gracefully when writing code in Pyspark, fill in the blanks to complete the below statement
1._____
2.
3. Spark.read.table(“table_name”).select(“column”).write.mode(“append”).SaveAsTable(“new_table_name”)
4.
5._____
6.
7. print(f”query failed”)

try: failure:

try: catch:

try: except:

try: fail:

try: error:

NO.36 What is the main difference between AUTO LOADER and COPY INTO?

COPY INTO supports schema evolution.

AUTO LOADER supports schema evolution.

COPY INTO supports file notification when performing incremental loads.

AUTO LOADER supports reading data from Apache Kafka

AUTO LOADER Supports file notification when performing incremental loads.

Explanation
Auto loader supports both directory listing and file notification but COPY INTO only supports di-rectory listing.
Auto loader file notification will automatically set up a notification service and queue service that subscribe to file events from the input directory in cloud object storage like Azure blob storage or S3. File notification mode is more performant and scalable for large input directories or a high volume of files.

Auto Loader and Cloud Storage Integration
Auto Loader supports a couple of ways to ingest data incrementally
1.Directory listing – List Directory and maintain the state in RocksDB, supports incremental file listing
2.File notification – Uses a trigger+queue to store the file notification which can be later used to retrieve the file, unlike Directory listing File notification can scale up to millions of files per day.
[OPTIONAL]
Auto Loader vs COPY INTO?
Auto Loader
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional setup. Auto Loader provides a new Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory.
When to use Auto Loader instead of the COPY INTO?
*You want to load data from a file location that contains files in the order of millions or higher. Auto Loader can discover files more efficiently than the COPY INTO SQL command and can split file processing into multiple batches.
*You do not plan to load subsets of previously uploaded files. With Auto Loader, it can be more difficult to reprocess subsets of files. However, you can use the COPY INTO SQL command to reload subsets of files while an Auto Loader stream is simultaneously running.
Auto loader file notification will automatically set up a notification service and queue service that subscribe to file events from the input directory in cloud object storage like Azure blob storage or S3. File notification mode is more performant and scalable for large input directories or a high volume of files.
Here are some additional notes on when to use COPY INTO vs Auto Loader
When to use COPY INTO
https://docs.databricks.com/delta/delta-ingest.html#copy-into-sql-command When to use Auto Loader
https://docs.databricks.com/delta/delta-ingest.html#auto-loader

NO.37 You are still noticing slowness in query after performing optimize which helped you to resolve the small files problem, the column(transactionId) you are using to filter the data has high cardinality and auto incrementing number. Which delta optimization can you enable to filter data effectively based on this column?

Create BLOOM FLTER index on the transactionId

Perform Optimize with Zorder on transactionId
(Correct)

transactionId has high cardinality, you cannot enable any optimization.

Increase the cluster size and enable delta optimization

Increase the driver size and enable delta optimization