SQL with Manoj | SQL Server, SQL Queries, DB concepts, Azure, Spark SQL, Tips Tricks with 500 artic-sqlwithmanoj.com

SQL with Manoj | SQL Server, SQL Queries, DB concepts, Azure, Spark SQL, Tips Tricks with 500 artic

Time 2020-10-08 18:04:41

Web Name: SQL with Manoj | SQL Server, SQL Queries, DB concepts, Azure, Spark SQL, Tips Tricks with 500 artic

WebSite: http://sqlwithmanoj.com

ID:74993

Keywords:

Server,Queries,DB,

Description:

SQL with Manoj SQL Server, SQL Queries, DB concepts, Azure, Spark SQL, Tips Tricks with 500 articles !!! I was creating a Hive table in Databricks Notebook from a Parquet file located in Azure Data Lake store by following command:val df = spark.read.parquet( abfss://adlsstore@MyStorageAccount.dfs.core.windows.net/x/y/z/*.parquet )df.write.mode( overwrite ).saveAsTable( tblOrderDetail )But I was getting following error:warning: there was one feature warning; re-run with -feature for detailsjava.lang.UnsupportedOperationException: Parquet does not support decimal. See HIVE-6384As per the above error it relates to some Hive version conflict, so I tried checking the Hive version by running below command and found that it is pointing to an old version (0.13.0). This version of Hive metastore did not support the BINARY datatypes for parquet formatted files.spark.conf.get( spark.sql.hive.metastore.version )Also as per this Jira Task on HIVE-6384 the support for multiple datatypes was implemented for Parquet SerDe in Hive 1.2.0 version.So to update the Hive metastore to the current version you just need to add below commands in the configuration of the cluster you are using.Click on Clusters click Edit on the top expand Advanced Options under Spark tab and Spark Config box add the below two commands:spark.sql.hive.metastore.version 1.2.1spark.sql.hive.metastore.jars builtinYou just need to restart the cluster so that the new settings are in use.Some similar errors: Parquet does not support date Parquet does not support timestamp I was setting up an Azure Data Factory (ADF) to copy files from Azure Data Lake Storage Gen1 to Gen2, but while running the Pipeline it was failing with below error:Operation on target Copy_sae failed: Failure happened on Sink side.ErrorCode=DelimitedTextMoreColumnsThanDefined, Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error found when processing Csv/Tsv Format Text source 0_2019_11_09_01_43_32.avro with row number 53: found more columns than expected column count 27.,Source=Microsoft.DataTransfer.Common, After some research I figured out that its because I had not selected the Binary Copy option while creating the Copy Data activity (shown in image below).Root Cause: If the files under a particular folder you are copying contains files having different schema like, variable number of columns, different delimiters, quote char settings, or some data issue, the ADF pipeline will end up running in this error.So, for bulk copying or migrating your data from one Data Lake to another try choosing this option, so that ADF won t open the files to read schema, but it just simply treats every file as binary and copy it to the other location.Hope this helps !Migrate ADLS Gen1 to Gen2 There are times when you are left with no or very less space in you SQL Server VM disks, and the main cause of this is not only the lot amount of data coming in but the heavy processing your database Server is doing, and thus filling up the entire log space.SQL Server provides you some effective commands to shrink the database files which got inflated due to heavy processing and to make room for new data.ALTER DATABASE db_name SET RECOVERY SIMPLE;DBCC SHRINKFILE (N' log_file_name ' , 100);But this works with one database at a time, so what if you have lots of databases? It will be tiring and time consuming to visit each DB, get the database file names and apply the shrink command. Its even slow to do the same via SSMS GUI.Note: Do not change the recovery mode in a PROD environment unless it is really required.With the below T-SQL statements you can generate scripts for all the databases and can run to change all database settings at once:-- Generate SQL Script to change Recovery mode to 'SIMPLE' for all DBs:SELECT 'ALTER DATABASE [' + name + '] SET RECOVERY SIMPLE;' as SimpleRecovery4AllDBsFROM sys.databases WHERE recovery_model_desc 'SIMPLE'-- Generate SQL Script to Shrink log files of all DBs:SELECT 'USE ' + DB_Name(database_id) + '; DBCC SHRINKFILE (N''' + name + ''' , 100);' as ShrinkAllDBsFROM sys.master_files WHERE database_id 4 and [type] = 1Please let me know if you have any other way to do the same (and in more effective way) !!! So while creating a Python notebook and running it on my Databricks Cluster I observed following error:Your administrator has only allowed sql and scala commands on this cluster. This execution contained at least one disallowed language.Its obvious that the error is due to some restriction applied at Cluster level. So I went to the Cluster settings page and checked the Spark Config and found below key-value configuration settings:spark.databricks.repl.allowedLanguages sql,scalaSo if you want to run other languages like Python R you can remove the entire line or restrict any language(s) then change is as per your needs. While working on some data analysis I saw one Spark SQL query was not getting me expected results. The table had some good amount of data, I was filtering on a value but some records were missing. So, I checked online and found that Spark SQL works differently compared to SQL Server, in this case while comparing 2 different datatypes columns or variables. I m populating some test data to reproduce the scenario, for that I m inserting 9 rows and storing decimal values as String, query below:CREATE OR REPLACE TEMPORARY VIEW vwTestDataType as select * from values ( row1 , 2.0 ), ( row2 , 1.5 ), ( row3 , 1.0 ), ( row4 , 0.8 ), ( row5 , 0.6 ), ( row6 , 0.4 ), ( row7 , 0.2 ), ( row8 , 0.0 ),( row9 , null);describe vwTestDataType;col_name | data_type | commentcol1 | string | nullcol2 | string | null Now, I ll create a similar query where I was observing the issue. The below query should return me 7 rows, but instead it returns just 3 rows.select * from vwTestDataType where col2 0Running above query in SQL Server throws below error for the same dataset: Conversion failed when converting the varchar value 2.0 to data type int. Let s check why Spark SQL query didn t failed and why its behaving like this.I will use EXPLAIN EXTENDED operator to know what s happening with the query while creating the Logical Plan.explain extended select * from vwTestDataType where col2 0Here is the plan you can see that under Analyzed Logical Plan the column col2 is getting implicitly typecasted to INT, as the comparison value is an INT type. Thus it is converting all 0.x values to 0 and filtering them out.Plan== Parsed Logical Plan == Project [*]+- Filter ( col2 0) +- UnresolvedRelation `vwTestDataType`== Analyzed Logical Plan ==col1: string, col2: stringProject [col1#13284, col2#13285]+- Filter (cast(col2#13285 as int) 0) +- SubqueryAlias `vwtestdatatype` +- Project [col1#13284, col2#13285] +- LocalRelation [col1#13284, col2#13285]== Optimized Logical Plan ==LocalRelation [col1#13284, col2#13285]== Physical Plan ==LocalTableScan [col1#13284, col2#13285] Now to avoid this issue you must explicitly type cast the column and value to the exact datatype to get expected result. Like here we should convert the String column value to Double, this way the query returns all 7 rows as expected:select * from vwTestDataType where double(col2) double(0)--OR--select * from vwTestDataType where col2 0.0Let s again check the Logical Plan of the modified query by using EXPLAIN EXTENDED operator how it looks like:explain extended select * from vwTestDataType where double(col2) double(0)--OR--explain extended select * from vwTestDataType where col2 0.0plan== Parsed Logical Plan == Project [*]+- Filter ( double( col2) double(0)) +- UnresolvedRelation `vwTestDataType`== Analyzed Logical Plan ==col1: string, col2: stringProject [col1#13213, col2#13214]+- Filter (cast(col2#13214 as double) cast(0 as double)) +- SubqueryAlias `vwtestdatatype` +- Project [col1#13213, col2#13214] +- LocalRelation [col1#13213, col2#13214]== Optimized Logical Plan ==LocalRelation [col1#13213, col2#13214]== Physical Plan ==LocalTableScan [col1#13213, col2#13214]So while working with Spark SQL we should make sure there should not be such datatype conflicts, and moreover these type of issues should be handled in way beginning while modelling the tables with correct datatype. Enter your email address to follow this blog and receive notifications of new posts by email. Join 769 other followers

TAGS:Server Queries DB

<<< Thank you for your visit >>>

Technical articles on SQL Server, TSQL, SQL, Database Development, Database Design, Data Warehousing, ETL, Unix, Linux and C++ (CPP)

Websites to related :
Aiga Forum, an Ethiopian forum f

　　A call to the International Community from the office of Inter-Government Relations of the National Regional State of Tigray The Tigrai National Regio

RANE | Mixers, players and acces

　　NewsletterBe the first one to discover new Rane product updates. Get access to downloads, tips tricks, and Rane events. Newsletter

A.P. BUCK INC.

　　For35+ years A. P. BUCK, INC. has manufactured the most advanced and innovative instruments for indoor air sampling. We supply air sampling pumps and

DADO SENS DERMACOSMETICS – Der

　　alle Produkte anzeigenNEUHEITSUNSONNENCREME SPF 50bei sensibler HautZum ProduktSets & AktionenGeschenksetsPROBALANCEGESICHTS- UND KÖRPERPFLEGE-SETbei

UEFA Euro 2016 Predictions Game

　　Prove that you know all there is to know about UEFA Euro 2016 with the ultimate football predictions game. Get the scores of UEFA Euro 2016 games rig

University of Phase Change Mater

　　Infinite Shelf Life - Zero Thermal Degradation - for Free Cooling, Telecom Shelters, Air-condition Back-up, Electronic Comfort, Cold Storage and Cool

Carl Roth - International | H

　　Heating thermoshaker from €1,214.00 Go to product selection Heating and magnetic stirrer Rotilabo® with integrated contact thermometer MH 20 Digita

Witches Of The Craft® Witchcra

　　WelcomeTo Witches Of The Craft®We are honored that you have found your way to our door. Please come on in have a cup of freshly brewed tea and sit w

Office of the Registrar

　　Virtual Office HoursThe Office of the Registrar is currently closed for walk-in services. Please visit our Virtual Office Hours hosted Monday throug

Home - Diagnostic Learning Servi

　　During COVID-19, we will continue to provide in person assessments.But are also offering our same comprehensive evaluation through an online platform.

ads

Hot Websites

WHY YOU SHOULD Be listed

Matt Cutts, a software engineer and former head of the web spam team at Google, stated back in 2012 that "spammy directories" are disregarded while those who "excercise editorial discretion" worth it. Even though countless business directories litter the internet, they won't all offer a positive change. Don't take unneeded risks and suggest your website to one that Google loves!

Email: liunxqq$$$126.com ($$$=@) ie websites |