World of MSBI: 2013

Tuesday, 1 October 2013

How to handle source records containing delimiters in delimted Flat file connection manager

Introduction

It's common scenario in ETL while you find presence of delimiter character inside the source records while importing data into flat file. Though the flat file will be exported without any error while reading it back it will create problem as the number of columns will increase for that row.

Without wasting more words I will now show to avoid such situations.

Solution

To avoid such problems you need to add a text qualifier in the Flat file connection manager.

In the following example the source data is present inside SQL Server. Following is the data.

We will use double quotes(") as the text qualifier. So while importing the data we will replace the double quotes with two double quotes. Means

This query is placed in the OLEDB source. Then the records pulled will be exported to a comma delimited Flat file. The text qualifier should be set as double quotes(").

While reading the file also the same connection manager should be used. And following is the data viewer screen shot while reading the data back from the Flat file:

Saturday, 28 September 2013

SQL Server, SSIS, SSRS, SSAS and DW interview questions...

61. How to remove Sort transformation?

Ans. You can avoid the Sort transformation while using OLEDB sources. You can use order by clause to sort the input records. Then set the Issorted property as true and set the sort key positions of the order by columns.

62. How to put if else in Data Flow and Control Flow?

Ans. Control flow- Precedence constraints

Data flow-Conditional Split transformation

63. How to set the Buffer in Data Flow?

Ans. Each DFT has 2 properties-DefaultMaxBufferRows and DefaultMaxBufferSize which control respectively the number of rows which can be stored in a single buffer and how many buffers can be created.DefaultMaxBufferRows has default value 10000 rows and DefaultMaxBufferSize has default value 10 MB.Any data eg BLOB data if cant be put in the buffer is spooled to the File System.

64. How to ignore the failure in Control Flow and Data Flow?

Ans. Control Flow- Use precedence constraints as on completion

Data Flow- Redirect row on failure

65. How to migrate DTS packages to SSIS packages?

Ans. Open BIDS.In BIDS go to the Project tab and click on Migrate DTS 2000 Package.Give the Source location and destination location and complete the wizard steps.

66. How you upgrade a package from SSIS 2005 to SSIS 2008r2? Ans.

67. How to create a SSAS Cube and deploy it?(Building a cube with data source)

Ans. 1.First create an analysis services project.

2.Create a Data Source (Connection made to the server and database and authentication provided.) by right clicking on the data source of the project which will start a wizard.(Extension will be .ds)

3.Create a Data Source View(Add tables from one or more database and add related tables for the added tables.Can also add filters) by right clicking on the data source view of the project which will start a wizard.(Extension will be .dsv)

4.Right Click on the Cube and Click on a new cube which will start a Cube wizardThe Cube Wizard will ask for Select bui.Here we have to check the measure group tables and check the relevant measures for the measures group tables and dimension tables.(Extension will be .cube)

5.To hide particular measure groups and dimensions to particular users,create perspectives.We can rename the already created perspective and for the new perspective we can check and uncheck based on the requirement.

6.In BIDS,we can build and then deploy the cube.After that we can right click on the particular cube and then browse.In browse window,we will have drag nad drop row fields and column fields.In Row fields basicallly we will drag dimension rows and in column fields we will drag fact measures.

68. How to create a SSAS Cube and deploy it?(Building a cube without data source)

Ans. 1.Right Click on Cube.Cube wizard will ask for 'Select build method-with or without using data source'.Click on ""Build the cube without using data source"".

2.Then define measures.

3.Define Dimensions and Dimension Usage and complete the wizard.

4.Now create the datasourceview and create data source in that.The wizard that pops up this time is called Schema Generation wizard.

5.Populate the data source with data so that cube can be populated.

6.Process the Cube and View the Cube."

69. How to create an SSRS report using SSAS?

Ans. 1.In the Data Source,put the Source type as Analysis Services and put the analysis Server name.

2.In the Dataset,click on Query builder.The Query builder will open the SSAS Query Designer Window.

3.Then we can drag and drop fields on the results pane of the Query designer as per our requirement.

4.There are several buttons in the tool bar of the Query builder that can be used in MDX like Add Calculated Members,Show Empty Cells,Auto Exec,Show Aggregates.

5.The Auto Exec button turns on and off the execution of queries.

6.The Show Empty Cells works this same as the MDX Non Empty clause; basically it shows / hides non empty cell values.

7.In the upper half of the design area,the query designer contains a filter grid which is useful for filtering the results and serves as an easy way of creation of parameters.

8.Clicking OK on the SSAS query builder gives rise to MDX query in the query string of Dataset.

9.Now we can make any type of reports using the fields provided by the dataset."

70. How to create parameters/Multivalued parameters in SSRS report using SSAS?

Ans. 1.We can put all values in the filter expression and can check the parameter.

2.By default,in report parameters a report parameter will be created and a hidden dataset for the parameter will be created.

3.If we right click on the dataset and click on ""Show Hidden Datasets"",then we will get the dataset.

4.But in previewing the report,we will get two All values-one Select All and another All.One Select All is the property of the SSRS multivalued Property of the parameter and the second parameter is driven from the MDX query that SSRS runs to populate the parameter list.

5.We can remove the second one by :-

making a change in the MDX query i.e. making Parameters.AllMembers as parameters.Children or filtering out the condition that Parameterlevel > 0 .(Because All of MDX query will have Parameter level 0)

71. What is Cube Scheduling?

Ans. When we deploy a SSAS Project,the Data Sources,Data Source Views are put in the local instance of the SQL Server Analysis Server.Whatever changes are made to the production database or Analysis service will not be reflected in the local instance of analysis service until we deploy the project again.We have to process the cube each and everytime.

We can process the cube manually (using Management Studio).

1)In order to schedule the processing of the cubes at a particular time,we can make use of Analysis Services Process Task in SSIS and give the Cube name.Then we can make a job scheduling this Package.

2)Right click on the cube and click on process.Click on Script and click on Copy to clipboard and cancel.Then make a new job and in command,paste the same and schedule the job."

72. What is Dimensions?

Ans. Dimension is a context by which we slice and dice data in a DW.Types :1)Conformed 2)Junk 3) Role playing

73. What is Conformed Dimensions?

Ans. Two dimensions are said to be conformed if they are completely identical or strict mathematical subsets of each other.Conformed dimensions have same row headers and labels also.Eg:-Time dimension used for Fact1 and Time dimension used for Fact2

74. What is Junk Dimensions?

Ans. After database modelling,if there are some attributes which are not linked to any of the dimensions or fact and they are of less importance and they carry some distinct values(mostly flag values or indicators),then in that case we can go for a dimension which carries all these attibutes.These dimensions are known as junk dimensions.Benefits :-One dimension having all the attributes will perform better compared to individual dimension for each attribute.Disadvantage:-If in future we will think of making an attribute as a fully fledged dimension,then in that case we have to restructure the cube.

75. What is Degenerate Dimensions?

Ans. When we have columns on the fact table that do not relate to any of the dimensions but are most required for analysis,then we can make a separate dimension for the attribute.This dimension is called Degenerate Dimension.For Eg: A transaction number or a ticket number which is a most requested feature can be stored separately as distinct values in a degenerate dimension because the fact will have multiple records for 1 transaction id.So while quering or analysing Transaction no or Ticket no,we can go by degenerate dimension table.In some of the degenerate dimensions,we can put another column which is not degenerate and a running id for the degenerate transaction id.The difference between normal dimension and a degenerate dimension is that degenerate attribute should not be available to the end user for analysis in analysis services.

76. What is Role Playing Dimensions?

Ans. Dimensions are often recycled for multiple applications within the same database.A Date dimension can be used for Date_of_Hire,Date_of_Joining etc.So it’s a role playing dimension.

77. Explain Slowly Changing Dimensions?

Ans. Slowly Changing Dimensions are dimensions that have data that changes slowly rather than changing on a time based regular schedule.

1)Type 0-No action is taken

2)Type 1-overwrites old data with new data and therefore does not maintain history

3)Type 2-Maintains history by inserting multiple records for a natural key with different surrogate keys(a running number against a natural key of the table) and having version numbers or Start Date-End Date.

4)Type 3-Contains a column in the table which represents Original_state.This column contains the first value of the attribute where as Current_state contains the present value.This doesnt maintain histories of values in between the original and current state.

5)Type 4-Contains a table which will have the current record and another history table with a create date column.

6)Type 6-Combines approaches of types 1,2 and3.It maintains individual records having start date,end date and Flag.The Current record's Flag is set to Y rest all are set to N's."

78. What is Factless tables or Bridge Tables?

Ans. If in a fact table there exists many to many relationships with dimensions with no figures or measures or values then it is known as Factless fact or bridge tables.This table is needed to bridge the relationships between dimensions.For Eg suppose we have a product dimensions which contain product key and product related information and we have an offer dimension which contain offer key and discount related information,its timing and pricing.We can make a bridge table of Productoffer which stores the product which has special offers.If the Fact stores Product,Offer and Sale as Columns,then we will surely miss the information about products which were having special offers but not having a sale.

79. What is Snapshot and transaction fact tables?

Ans. A Transaction fact table records events and for each event certain measurements are recorded or values recorded.So when we record a sale we create a new row altogether for the information relating to sale.A snapshot fact table records the state of something at different points of time.For Example the sales of each product every month.Snapshot Fact table therefore keeps aggregated information which results in lesser data rows,lesser cubes,faster cube processing and faster querying.But it has no information on individual transaction levels.

80. What is UDM(Unified Dimensional Modelling)?

Ans. It provides a bridge between the user and the data sources.A UDM is constructed over one or more physical data sources and then the user issues queries against the UDM.

81. What is Two tier architecture of SSAS?

Ans. SSAS uses both server and client components to supply OLAP and data mining functionality.

1)Server component is implemented as a Microsoft windows service.

2) Clients interact using Standard XML."

82. Difference between Derived measure and Calculated measure?

Ans. 1) A derived measure calculates before agggregations are created and the value is stored in the Cube.

2)A calculated measure calculates after aggregations are created and the value is not stored inside the Cube."

83. What is Cube?

Ans. Cubes are a logical representation of multidimensional data.It stores large amounts of historical data.It alos shows the point in time data and it shows the member level data and security is very detailed.

84. Differences between OLTP and OLAP

Ans. 1)OLTP contains the source information of any system.OLAP contains the data from OLTP.

2)OLTP controls and runs the fundamental business tasks like insertion,updation etc but OLAP is meant for planning,analysing and decision support.

3)OLTP performs transactions like insert,update,delete etc.OLAP performs querying and analysis of the aggregated data of OLTP.

4)OLTP basically uses smaller and less complex queries but OLAP uses complex queries.

5)OLTP is normalized with 3N but OLAP is denormalized with Star Schema or snowflake schema.

6)OLTP uses relational schema but OLAP uses multidimensional schema.

7)Backup of OLTP is very important but in case of OLAP we can always bring it from OLTP.

8)Data in OLTP can be seen instantly but in OLAP a batch or scheduler wil refresh the data."

85. Star Schema vs Snowflake Schema

Ans. Snowflake Schema:No redundancy,normalized,more complex queries,more joins,more FK's,good to use for small dw,To be used when dimension table is small in size

Star Schema:Redundant data,de normalized,less complex,good to use for large dw's,To be used when dimension table is very big in size"

86. What is Normalization?

Ans. Normalization is the mathematical model of evaluating the relational quality of the data model.The purpose is to reduce update anomalies.

87. What is First Normal Form(1 NF)?

Ans. The 1st Normal Form states that:

1)Every unit of data should be represented by a scalar value.

2)Each and every attribute i.e column should be unique in an entity.

3)Each and every tuple should be unique in an entity.

Consider a case where a table has columns basecamp,tour1,tour2,tour3.This table violates 1NF because there are so many attributes holding the same meaning.We can make two tables one having basecamp id and basecamp and the other table having basecampid and tour to bring it to 1NF."

88. What is Second Normal Form(2 NF)?

Ans. The 2nd Normal Form states that:

1)The entity should be in 1 NF and

2)There should be no partial dependency on the Primary Key.Suppose in the basecamp-tour table,we have columns basecampid,tour and phonenumberofthebasecamp.The PK of the table is a composite primary key having basecampid and tour columns.But in this case phonenumber is dependent on basecamp not on tour.So this violates 2NF.We can make two tables basecamp-tour having basecampid and tour and second table having basecampid and phonenumber.

89. What is Third Normal Form(3NF)?

Ans. The 3rd Normal Form states that:

1)the entity should be in 2nF and

2)There should be no transitive dependency on the primary Key.Suppose there ia atable basecamp-tour in which we have columns like Basecamp_id,Tour,Guide and GuideDateofHire.In this case GuideDateofHire is dependent on Guide which again is dependent on Basecamp_id.Thats why this violates the 3 NF.To resolve this,we can make another table Guide having Guide and GuideDateofHire."

90. What is Boyce Codd Normal Form?

Ans. It occurs between the third and fourth normal forms.It handles those kind of entities which have two sets of primary keys.In such case BCNF states that the entity should be split into two entities one for each primary key.

91. What is Fourth Normal Form?

Ans. It reduces problems created by complex composite primary keys.If there are three attributes in an entity in which two of the attributes combine to form a composite primary key but without the third attribute the row is not unique,then in that case the entity violates 3NF.In such case we should split the table into two separate tables.

92. What is Fifth Normal Form?

Ans. The 5NF provides the method for designing relationships that involve 3 or more entities.A three-way or ternary relationship, if properly designed, is in the fifth normal form.

93. What is Recovery Models in SQL Server 2008?

Ans. Simple,Full,Bulk Logged

Alter database databasename set recovery simple

94. What is Simple Recovery Model?

Ans. 1)SQL Server stores only minimal information in the transaction log.

2)The Changes made to the database are written to the data and to the log.

3)But the log files are not kept permanently.

4)It should be used on a db which is meant for development or test where there is a reduced need of frequent database backups.

Backup options for this model are Full .

95. What is Full Recovery Model?

Ans. This allows us to recover everything.It can use fullest set of backup options.So it can be used to recover the fullest set.Every insert,update,delete is logged into the transaction log.The committed changes are deleted from the log once the back up of the log is taken.It allows point-in-time recovery means it allows us to recover the database to a particular point in time.Backup options for this model are Full,Differential and Transaction Log.

96. What is Bulk Logged Model?

Ans. This is same as Full Recovery model but this is used in cases of bulk operations like create index,select into and bulk insert.All the bulk operations are written minimally to the transaction log there by improving performance.Backup options for this model are Full,Differential and Transaction Log.

97. What is Backups?

Ans. Full,Differential and Transaction

98. What is Full Backups?

Ans. A full backup retsores all our data but it will not restore all the data at a specific point of time.

99. What is Differential Backup?

Ans. A differential back up copies all the data changes since the last full back up.So these back ups are usually faster and smaller than Full back ups.A Full backup creates a checkpoint meaning that the differential back up has to start from that point.

100. What is Transaction Log Back up?

Ans. A transaction log backup will copy all the data changes since the last Full transaction or previous transaction log back up.they are fast and small but the demerit is recovery where we will get a transaction log for each day.

101. What is MSDTC?

Ans. MSDTC is a Transaction Manager which allows the client application to include different sources of data within one transaction.

For example: SSIS includes different data sources data into one transaction to main the acid property.

102. What is ACID?

Ans. Atomic, Consistent, Isolated and Durable.

103. What is Report Subscription?

Ans. A Subscription is a standing request to deliver a report at a specific time or to respond to a specific event, and in an application file format that you specify in the subscription.

Two types- Standard Subscription, Data driven Subscription

Stanard Subscription- Consists if static values and cann't be modified during subscription processing.

Datadriven Subscription- get subscription information at runtime by quering an external data source that provides values used to specify reciptant, parameter or application format.

104. What is XML from a select query?

Ans. <Select query> For XML RAW/AUTO/PATH/EXPLICIT;

select * from emp for xml raw

select * from emp for xml auto

select * from emp for xml path

<row><EmpName>Shakti</EmpName><DeptID>2</DeptID><Sal>3000</Sal></row><row><EmpName>Praveen</EmpName><DeptID>2</DeptID><Sal>4000</Sal></row><row><EmpName>Rohit</EmpName><DeptID>1</DeptID><Sal>2000</Sal></row><row><EmpName>Wamsi</EmpName><DeptID>1</DeptID><Sal>1500</Sal></row><row><EmpName>Kapil</EmpName><DeptID>2</DeptID><Sal>1300</Sal></row><row><EmpName>XXX</EmpName><DeptID>10</DeptID><Sal>1000</Sal></row><row><EmpName>Rahul</EmpName><DeptID>1</DeptID><Sal>1000</Sal></row>

select 1 as Tag,

NULL as Parent,

EmpName as [Employee!1!EmpName],

NULL as [DeptID!2!DeptID],

NULL as [Sal!3!SAL]

from emp

union

select 2 as Tag,

1 as Parent,

EmpName as [Employee!1!EmpName],

DeptID as [DeptID!2!DeptID],

null as [Sal!3!SAL]

from emp

union

select 3 as Tag,

2 as Parent,

EmpName as [Employee!1!EmpName],

DeptID as [DeptID!2!DeptID],

Sal as [Sal!3!SAL]

from emp

order by EmpName

for xml explicit

the values in the Tag and Parent meta columns, the information provided in the column names, and the correct ordering of the rows produce the XML you want when you use EXPLICIT mode.

105. What is SET XACT_ABORT { ON | OFF }?

Ans. Specifies whether SQL Server automatically rolls back the current transaction when a Transact-SQL statement raises a run-time error.

106. What is Computed Column?

Ans. A computed column is computed from an expression that can use other columns in the same table. The expression can be a noncomputed column name, constant, function, and any combination of these connected by one or more operators. The expression cannot be a subquery.

For example, in the AdventureWorks2008R2 sample database, the TotalDue column of the Sales.SalesOrderHeader table has the definition: TotalDue AS Subtotal + TaxAmt + Freight.

Unless otherwise specified, computed columns are virtual columns that are not physically stored in the table. Their values are recalculated every time they are referenced in a query. The Database Engine uses the PERSISTED keyword in the CREATE TABLE and ALTER TABLE statements to physically store computed columns in the table.

Computed columns can be used in select lists, WHERE clauses, ORDER BY clauses, or any other locations in which regular expressions can be used, with the following exceptions:

Computed columns used as CHECK, FOREIGN KEY, or NOT NULL constraints must be marked PERSISTED. A computed column can be used as a key column in an index or as part of any PRIMARY KEY or UNIQUE constraint if the computed column value is defined by a deterministic expression and the data type of the result is allowed in index columns.

For example, if the table has integer columns a and b, the computed column a + b can be indexed, but computed column a + DATEPART(dd, GETDATE()) cannot be indexed because the value may change in subsequent invocations.

A computed column cannot be the target of an INSERT or UPDATE statement.

Bulk Upsert operation of SalesForce from SSIS package using Data Loader

Introduction:

In this post I will explain how to push data inside Salesforce through SSIS package using Bulk Load API. I will demonstrate it by using two example how to pull the data from Sql Server or any other database and how to pull data from CSV file.

Technology used:

SSIS, SaleForce Data Loader Tool.

Pre-requisites:

Sales Force Data Loader tool and the SSIS must be installed in your system.

Over View:

Data Loader has a command line utility called process.bat. This utility can load data inside SalesForce from different source systems like CSV file,Oracle, SQL Server etc..

To import data from different source system the JDBC driver is required for that to establish the connection.

Two xml files are required named:

process-cnf.xml
database-cnf.xml

The database-cnf file will have the source related information and the process-cnf will contain the salesforce related information.

The SSIS package will call the process.bat file which in turn use process-cnf and database-cnf and load data inside SalesForce.

Process

Step 1

Install Apex Loader. Go to the following path:

C:\Program Files (x86)\salesforce.com\Data Loader\bin

You will find the following bat files:

1. encrypt.bat

2. process.bat

According to your source system get the JDBC drive. For example for SQL Server have the following driver:com.microsoft.sqlserver.jdbc.SQLServerDriver

And modify the process.bat in the following way:

@echo off

if not [%1]==[] goto run

echo.

echo Usage: process ^<configuration directory^> ^[process name^]

echo.

echo configuration directory -- directory that contains configuration files,

echo i.e. config.properties, process-conf.xml, database-conf.xml

echo.

echo process name -- optional name of a batch process bean in process-conf.xml,

echo for example:

echo.

echo process ../myconfigdirAccountInsert

echo.

echo If process name is not specified, the parameter values from config.properties

echo will be used to run the process instead of process-conf.xml,

echo for example:

echo.

echo process ../myconfigdir

echo.

goto end

:run

set PROCESS_OPTION=

if not [%2]==[] set PROCESS_OPTION=process.name=%2

..\Java\bin\java.exe -cp ..\dataloader-27.0.1-uber.jar;"C:\Work Area\JDBC Driver\Microsoft JDBC Driver 4.0 for SQL Server\sqljdbc_4.0\enu\sqljdbc4.jar" -Dsalesforce.config.dir=%1 com.salesforce.dataloader.process.ProcessRunner %PROCESS_OPTION%

:end

Add the highlighted portion which is the driver path.

Step 2

Create a xml file named database-conf.xml. Please put the following XML code inside and provide the following information regarding your source:

1. Connection URI

2. User ID

3. Password

4. Source Query

5. Source columns

<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">

<beans>

<bean id="dbDataSource"

class="org.apache.commons.dbcp.BasicDataSource"

destroy-method="close">

</bean>

<bean id="queryAccount"

class="com.salesforce.dataloader.dao.database.DatabaseConfig"

singleton="true">

</bean>

<bean id="queryAccountSql"

class="com.salesforce.dataloader.dao.database.SqlConfig" singleton="true">

<value>

YOUR Query </value>

</property>

<list>

<value>Column0</value>

<value>column1</value>

</list>

</property>

</bean>

</beans>

Step3:

Create an encrypted SalesForce password file by using the following command:

encrypt.bat –e <password> "<filepath>\key.txt"

Step4:

Create a mapping file between source and target columns. The file should of SDL extension.

Please find the sample mapping file

#Mapping values

#Mon Jan 23 17:07:24 PST 2006

SLA__c=SLA__C

BillingCity=BILLINGCITY

=SYSTEMMODSTAMP

OwnerId=OWNERID

CustomerPriority__c=CUSTOMERPRIORITY__C

AnnualRevenue=ANNUALREVENUE

Description=DESCRIPTION

BillingStreet=BILLINGSTREET

ShippingState=SHIPPINGSTATE

=CREATEDBYID

=CREATEDDATE

Website=WEBSITE

BillingState=BILLINGSTATE

=LASTMODIFIEDBYID

=LASTMODIFIEDDATE

Phone=PHONE

NumberOfEmployees=NUMBEROFEMPLOYEES

Type=TYPE

ShippingCity=SHIPPINGCITY

Rating=RATING

Industry=INDUSTRY

ShippingStreet=SHIPPINGSTREET

UpsellOpportunity__c=UPSELLOPPORTUNITY__C

AccountNumber=ACCOUNTNUMBER

TickerSymbol=TICKERSYMBOL

Id=ID

BillingPostalCode=BILLINGPOSTALCODE

Site=SITE

ParentId=PARENTID

Name=NAME

BillingCountry=BILLINGCOUNTRY

SLAExpirationDate__c=SLAEXPIRATIONDATE__C

ShippingCountry=SHIPPINGCOUNTRY

Fax=FAX

Sic=SIC

NumberofLocations__c=NUMBEROFLOCATIONS__C

ShippingPostalCode=SHIPPINGPOSTALCODE

Active__c=ACTIVE__C

Ownership=OWNERSHIP

SLASerialNumber__c=SLASERIALNUMBER__C:

Step5:

Create another xml file named process-conf.xml. This file will contain all the sales force related information. Following is the script inside the process-conf file:

<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">

<beans>

<bean id="PanelBookingUpsertfromDB"

class="com.salesforce.dataloader.process.ProcessRunner"

singleton="false">

<description>Sample 'upsert'.</description>

<map>

<!-- password below has been encrypted using key file, therefore it will not work without the key setting: process.encryptionKeyFile

the password is not a valid encrypted value, please generate the real value using encrypt.bat utility -->

</map>

</property>

</bean>

</beans>

Step6:

Put the following command inside a batch file and execute the batch file from the SSIS package by using Execute process task. Which in turn load the data into SalesForce.