Saturday, 28 September 2013

Splitting a Flat files to small files using SSIS Package

Introduction:

Its a common scenario when you have a huge flat file in GB containing millions of records and you want to split it into small files. There are different ways of doing the same. I will demonstrate you how you can spit the file by using SSIS.

Method-1: By using For each loop, Conditional split and Row Number

Control Flow

 
 As you can see there are two Data Flow Tasks, one for loop container and two flat file connection managers are present.
 

Data Flow Task 1: DFT To get Source Count

 
This data flow is just to get the record count of the Big flat file. It has a Flat file source that reads data from the Big flat file and counts the row by using a row count transformation.
 

Variables

 
 
These are 4 variables:
  1. Counter: Used as the Counter in for loop
  2. endCounter: Upper bound of the counter in for loop
  3. reCount: Store the record count of the Big flat file
  4. splitCount: Count of record to be present in each split files

For Loop Container

 
This the condition placed inside for loop container.
 

Data Flow Task 2: Data File Splitter

 
As you can see it fetches the data from the Big flat file and passes it through a Script Component which attach a row number to each of the records.
 
This is the code present inside the Script component:
 
 
Once the row number is attached a conditional split transformation spits the file according to the row number. Following is the condition present inside the conditional split transformation:
 
 
 
After the file get spitted for the first cut it is being saved by a flat file destination. For each cut the flat file connection string changes dynamically and the flat file is saved with a different name. Following is the expression written inside the flat file destination for the Connection string property.

 

Pro

Simple easy to develop.
 

Con

Takes a lot of time.

Method-2: By using Script Component

 Control Flow

 
 Here as you can see there is only one data flow and inside that one flat file source is present. This flat file source fetches all the records from the file and passes it to a script component. The Script component splits the file into multiple flat files. Following code is present inside that:
#region Namespaces
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.IO;
#endregion
 
 [Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    string copiedAddressFile;
    private StreamWriter textWriter;
    private string columnDelimiter = ",";
    private string filepath = @"C:\Work Area\Innovation\Flatfile spit on a fly\test.txt";
    private string[] columns;
 
 
    public override void PreExecute()
    {
       
        IDTSInput100 input = ComponentMetaData.InputCollection[0];
        columns = new string[input.InputColumnCollection.Count];
        for (int i = 0; i < input.InputColumnCollection.Count; i++)
        columns[i] = input.InputColumnCollection[i].Name;
    }
 
    public override void PostExecute()
    {
       
    }
 
   
    public override void AcquireConnections(object Transaction)
    {
       
        
    }
 
    public override void ProcessInput(int InputID, Microsoft.SqlServer.Dts.Pipeline.PipelineBuffer buffer)
    {
        while (buffer.NextRow())
        {
            if (buffer[columns.Length-1].ToString() == "1")
        {
            filepath = @"C:\Work Area\Innovation\Flatfile spit on a fly\test" + Convert.ToString(Convert.ToInt32(buffer[columns.Length - 1]) - 1) + ".txt";
            textWriter = new StreamWriter(filepath, false);
            for (int i = 0; i < columns.Length; i++)
            {
                textWriter.Write(columns[i]);
                if (i != columns.Length - 1)
                    textWriter.Write(columnDelimiter);
            }
            textWriter.WriteLine();
        }
            if (Convert.ToInt32(buffer[columns.Length - 1]) % 2000000 == 1)
        {
            textWriter.Close();
            filepath = @"C:\Work Area\Innovation\Flatfile spit on a fly\test" + Convert.ToString(Convert.ToInt32(buffer[columns.Length - 1]) - 1) + ".txt";
            textWriter = new StreamWriter(filepath, false);
            for (int i = 0; i < columns.Length; i++)
            {
                textWriter.Write(columns[i]);
                if (i != columns.Length - 1)
                    textWriter.Write(columnDelimiter);
            }
            textWriter.WriteLine();
        }
 
            for (int j = 0; j < columns.Length; j++)
            {
                textWriter.Write(buffer[j]);
                if (j != columns.Length-1)
                    textWriter.Write(columnDelimiter);
                else
                    textWriter.WriteLine();
            }
 
            if (buffer[columns.Length - 1].ToString() == "10000000000000")
            textWriter.Close();
        }
    }
}

Pros

Faster to split the file.

16 comments:

  1. Hi you are doing an excellent job. I was looking for this information. I discovered it on your web page it’s really awesome. I like that information. I am sure that these are your own views. They are really awesome.

    Property in Nehru Nagar Bhopal

    ReplyDelete
  2. Good POC Shakti. It is really helpful.

    ReplyDelete
  3. do you have this sample package available for learning

    ReplyDelete
    Replies
    1. Please refer : http://radhajyotsna0.blogspot.in/2014/09/execute-sql-task.html

      Delete
    2. Hi Shakti

      Please send me the sample of above solutions. It would be a great help to me

      Delete
    3. Here you go..
      https://drive.google.com/file/d/0BzfpMhz6kh76cXI4NVpGdUVLMzg/view?usp=sharing

      Delete
    4. Hi Shakti,

      Thanks for the reply.But package given by you is not opening.

      Thanks,Prabhat

      Delete
    5. Are you using SSIS 2012 version as it is created in SSIS 2012? I can able to open it using SSDT 2012.

      Delete
    6. In that case change the C# code present above to .Net and use it inside a script component of SSIS 2008 R2 package. That should work.

      Delete
  4. Thank you so much for providing information about SSIS and other utilities.I really feel that their is a need to utilise more of its aspects.

    SSIS Upsert

    ReplyDelete
  5. That's just what I need. I like to understand Method 2. What does it do ? How many files will it split into ? Is it based on size of the file of the number of rows ?

    ReplyDelete
  6. Hello, Thanks for the post. We should add 1 to the endCounter variable expression so that the extraction will have batch file for the remaining files. For example, if we have total records 105,000 & split count 50,000 we will have only 2 batch files with the expression. If we add 1 in the expression, the 3rd batch file will be created with 5000 records.

    ReplyDelete