Tuesday, August 2, 2016

ETL with Entity Framework 6.0

In this post I’m going to walk step-by-step through a number of performance improvements to a recent ETL process I worked on. The scenario here is loading folder metadata into a reporting database. Note that the use of EF or any ORM will add some overhead when compared to pure ADO.NET, although its ease of use is clearly beneficial.

Process overview

The ETL process works as follows.

1 Extract


  • The source data is read into a strongly-typed in-memory collection of type SourceFolderData
  • The collection is called folders
  • The extract phase is not included in this post, so consider the folders object to be prepopulated with folder metadata for around 10,000 folders, enough to see the impact of some performance tweaks

  • 2 Transform


  • The wider context of this application used EF Code First to build the reporting database. As a result there is an Entities project which includes classes for each of the entities relating to the application
  • The source type is SourceFolderData
  • The destination type is Folder

  • 3 Load


  • The DbContext used by EF is called context

  • Iteration 1: the basic approach

    Here the code simply attempts to load all the data in one transaction.

    foreach (var f in folders)
    {
        var folder = new Folder
        {
            FolderPath = f.FolderPath.Substring(0, f.FolderPath.Length - 1),
            Owner = f.Owner,
            LastChangeDate = f.LastChangeDate,
            LastAccessDate = f.LastAccessDate,
            CreationDate = f.CreationDate,
            SizeIncSubfolders = f.SizeIncSubfolders,
            FileCount = f.FileCount,
            SubFolderCount = f.SubFolderCount,
            PermissionsBlob = f.Permissions,
            CreatedOn = DateTime.Now
        };

        context.Folders.Add(folder);
    }

    context.SaveChanges();

    As you’d expect this works fine for a handful of items, but quickly breaks down when a significant number of items is loaded. For the 10,000 test items, the performance was as follows.

    Time (milliseconds) Max memory (MB) Memory Trend
    610,602 166.8 Slowly increasing

    This equates to over 10 minutes which is not acceptable.

    Iteration 2: turning off AutoDetectChanges

    A next improvement would be to turn the DbContext AutoDetectChanges property to false. From MSDN:

    “Detect Changes works by detecting differences between the current property values of the entity and the original property values that are stored in a snapshot when the entity was queried or attached.”

    Significant performance improvements can be seen by turning this off.

    context.Configuration.AutoDetectChangesEnabled = false;
    foreach (var f in folders)
    {
        ...
    }

    context.SaveChanges();

    The performance improvement is immediately obvious.

    Time (milliseconds) Max memory (MB) Memory Trend
    17,124 92.7 Slowly increasing

    Down from 10 minutes to 17 seconds, and a reduction in memory usage by over 44%.

    Iteration 3: batching inserts

    Another improvement is to batch the inserts and call SubmitChanges() several times instead of just once.

    context.Configuration.AutoDetectChangesEnabled = false;
    int folderCount = 0;

    foreach (var f in folders)
    {
        ...
       
        context.Folders.Add(folder);
       
        if (folderCount % 500 == 0 && folderCount > 0)
        {        context.SaveChanges();    }    folderCount++;
    }

    context.SaveChanges();

    The performance improvement is seen in time to execute, but memory usage worsened and the trend was for usage to continue rising if the number of items being processed increased.

    Time (milliseconds) Max memory (MB) Memory Trend
    9,144 117.2 Increasing

    Iteration 4: recreating the DbContext

    Another improvement to show in this post is disposing and recreating the DbContext instance with each SaveChanges().

    context.Configuration.AutoDetectChangesEnabled = false;
    int folderCount = 0;

    foreach (var f in folders)
    {
        ...
       
        context.Folders.Add(folder);
       
        if (folderCount % 500 == 0 && folderCount > 0)
        {
            context.SaveChanges();
            context.Dispose();
            context = new FolderContext();        // Remember to turn AutoDetectChangesEnabled to false again!
            context.Configuration.AutoDetectChangesEnabled = false;
        }
        folderCount++;
    }

    context.SaveChanges();

    An improvement on the overall time is seen along with a mild memory improvement.

    Time (milliseconds) Max memory (MB) Memory Trend
    6,514 112.4 Increasing

    Memory is seen to be increasing, however, so the performance would degrade as more items are processed.

    Iteration 5: adjusting the batch size

    Finally, by adjusting the batch size, an optimal level can be found where time, memory usage and memory trend is stable. The following table shows the performance as batch size increases.

    Batch Size Time (milliseconds) Max memory (MB) Memory Trend
    100 6,489 86.5 Stable
    200 6,227 87.2 Stable
    300 6,485 93.1 Stable
    400 6,248 101.0 Stable
    500 6,514 112.4 Increasing
    600 6,587 107.2 Increasing
    700 6,717 117.8 Increasing
    800 6,780 114.9 Increasing
    900 6,503 114.6 Increasing
    1000 6,847 124.9 Increasing

    The row highlighted in green shows the best setting for this particular process (on my dev laptop). The interesting thing to see was the stability in memory usage with small batch sizes. This gives me confidence to throw larger data sets at the process and not see memory usage get out of control.

    Conclusion

    Hopefully this post illustrates the differences that small adjustments to how EF is used can make to overall performance by showing some real numbers. These results would change on different hardware so it would be beneficial to adjust the batch size to find an optimal level for your process on your hardware. Finally I’d recommend reading Tips on writing an EF based ETL by Isaac Abraham for some other good practices.

    Tuesday, June 30, 2015

    Upload files to Exchange Public Folders using PowerShell

    The following script can be used to upload files from the file system to a named folder in Exchange Public Folders. Each file will be uploaded in turn, and if successful, the source file deleted. Any files that fail to be uploaded are listed when the script completes, and remain in the file system.

    The Script

    The Public Folders store name highlighted in yellow must be changed to the value you see in Outlook when browsing Public Folders:
    image
    Other parameters to change are highlighted in green.
    # Get Start Time
    Write-Host "-----------------------------"
    $start = (Get-Date)
    Write-Host ("Start time: " + $start)
    Write-Host "-----------------------------"
    # 0 Setup source and target (leave no trailing \)
    $source = "C:\Apps\Upload to PF\Batch 1"
    $targetFolder = "All Public Folders\Test\Batch 1"
    $failArray = @()
    # 1 Initiate connection to Outlook
    $outlook = new-object -comobject Outlook.Application
    $mapi = $outlook.GetNamespace("MAPI")
    # 2 Connect to PF
    $pfStore = $mapi.Session.Stores.Item("Public Folders - Jonny.Trees@youremailaddress.com")
    # 3 Connect to folder in PF
    $currentFolder = $pfStore.GetRootFolder()
    $folderPathSplit = $targetFolder.Split("\")
    for ($x = 0; $x -lt $folderPathSplit.Length; $x++)
    {
        $currentFolder = $currentFolder.Folders.Item($folderPathSplit[$x])
    }
    if ($currentFolder -eq $null)
    {
        ("Could not find folder: " + $folderPath)
        return
    }
    Write-Host ""
    ("Found folder: " + $currentFolder.FolderPath)
    Write-Host ""
    # 4 Get the source files and volume
    $volume = ("{0:N2}" -f ((Get-ChildItem -path $source -recurse | Measure-Object -property length -sum ).sum /1GB) + " GB")
    $volumeMB = ("{0:N2}" -f ((Get-ChildItem -path $source -recurse | Measure-Object -property length -sum ).sum /1MB) + " MB")
    ("Volume to be moved: " + $volume + ", " + $volumeMB)
    Write-Host ""
    $source = $source + "\*"
    $sourceFiles = Get-ChildItem -path $source -include *.msg
    # 5 Copy the files
    $counter = 1
    $failCounter = 0
    foreach($file in $sourceFiles)
    {
        try
        {
            $item = $mapi.OpenSharedItem($file)
            Write-Host -nonewline ("Moving file " + $counter + " of " + $sourceFiles.Count + " ")
            [void]$item.Move($currentFolder)
            Remove-Item -literalpath $file
            Write-Host "Success"
        }
        catch
        {
            $failArray += $file
            $failCounter = $failCounter + 1
            Write-Host "Fail"
        }
        finally
        {
            $counter = $counter + 1
        }
    }
    [System.Media.SystemSounds]::Exclamation.Play();
    # Get End Time
    $end = (Get-Date)
    $scriptExecutionTime = $end - $start
    # Print results
    Write-Host ""
    Write-Host "-----------------------------"
    Write-Host ("End time: " + $end)
    Write-Host "-----------------------------"
    Write-Host ""
    Write-Host ("Fail count: " + $failCounter)
    Foreach ($file in $failArray)
    {
        Write-Host $file
    }
    Write-Host ""
    Write-Host ("" + ($counter - $failArray.Count -1 ) + " of " + ($counter - 1) + " files totalling " + $volume + " were uploaded successfully")
    Write-Host ("Script execution time: " + $scriptExecutionTime.hours + " hours, " + $scriptExecutionTime.minutes + " minutes and " + $scriptExecutionTime.seconds + " seconds")
    Write-Host ("Average time per file: " + [math]::Round($scriptExecutionTime.totalseconds/$sourceFiles.Count,1) + " seconds")

     

    Output

    The output of the script is as follows.
    -----------------------------
    Start time: 06/30/2015 11:32:32
    -----------------------------
    Found folder: \\Public Folders\All Public Folders\Test\Batch 1
    Volume to be moved: 0.02 GB, 20.03 MB
    Moving file 1 of 3 Success
    Moving file 2 of 3 Fail
    Moving file 3 of 3 Success

    -----------------------------
    End time: 06/30/2015 11:32:45
    -----------------------------
    Fail count: 1
    C:\Apps\Upload to PF\Batch1\Large file.msg
    2 of 3 files totalling 0.02 GB were uploaded successfully
    Script execution time: 0 hours, 0 minutes and 13 seconds
    Average time per file: 4.3 seconds
    In this case the file failed to upload because the file size exceeded that allowed by Exchange. Note that the warm up time for this script adds to the average value you see of 4.3 seconds. Uploading batches of 5,000 items, 1-2GB in size in total typically takes 0.5 seconds per file in my production environment.

    Tuesday, February 18, 2014

    Very high opening/viewing numbers in audit log

    I recently had an issue where I was seeing crazy numbers of downloads from a document library in the site collection audit logs - 20k downloads from one user across a two day period when the library only contained 300 documents. The culprit? Outlook. Any client that allows a user to sync the contents of a document library, will write entries into the audit log every time a sync occurs. Outlook is one application that most users have that will allow syncing offline. This is initiated via a button in the Ribbon of the document library:

    image

     

     

     

    Fortunately there’s a workaround. In the Advanced Settings of the document library, there’s an option to prevent offline clients from syncing content from the library:

    image

    If you make this change to a library:

    • New connections to the library can still be created by clicking “Connect to Outlook” but the sync will never succeed
    • Existing connections will still exist in Outlook but attempts to sync will fail:image
      • This message only briefly appears on-screen when you manually do a “Send / Receive”
      • When “Send / Receive” is done in the background, the user is not warned with the above message
      • There is also no visual indication that the syncing of a library no longer occurs which may or may not be a good thing

    I also tested this with Colligo Email Manager, and unfortunately changing the “Offline Client Availability” has no effect – Colligo could still sync the contents of the library offline.

    Monday, June 17, 2013

    Posting multiple parameters to a Web API project using jQuery

    Ok this one had me stumped.  I was trying to pass multiple parameters to a Web API method using jQuery but no data would bind to my data transfer object when the method executed.

    *Note that this is a good example of how to pass multiple parameters to a Web API method when you take the issue discussed into consideration

    The Controller

    public class UserDTO
    {
       public int userID;
       public string username;
    }
    
    [HttpPost]
    public string Post([FromBody]UserDTO userDTO)
    {
       return userDTO.userID.ToString() + " " + userDTO.username;
    }

    The Web API route config

    config.Routes.MapHttpRoute(
       name: "DefaultApi",
       routeTemplate: "api/{controller}/{id}",
       defaults: new { id = RouteParameter.Optional }
    );

    The jQuery

    var apiUrl = "http://localhost:55051/api/User";
    var userDTO = {
       userID: 5,
       username: "testuser"
    };
    
    $.ajax({
       type: "POST",
       url: apiUrl,
       data: JSON.stringify(userDTO),
       datatype: "json",
       contenttype: "application/json; charset=utf-8"
    )};

    Fiddler Output

    Fiddler shows correctly passed JSON variables and in the Raw view I can see:
    {"userID":5,"username":"testuser"}

    On Execution

    userID = 0
    username = null

    The Issue?

    Case sensitivity on the .ajax jQuery call:
    • datatype should be dataType
    • contenttype should be contentType
    Thanks to those at StackOverflow for assistance with this one!

    Wednesday, December 15, 2010

    Moving a Records Center: Bad Idea

    Moving a SharePoint 2010 records center will cause several issues of varying significance.

    Document Management Implications:

    • If you have a custom Send To location allowing users to submit documents to the record center, this URL will need to be updated
      • A simple change
    • If your Send To location leaves a reference to the file in the original location, this will be broken:image
      • When clicking on the file reference you will see an error message “File not found”
      • This is irreversible

    Records Center Implications:

    • All content organizer rules will have incorrect URL’s to the destination library, so these all need to be updated
      • A time consuming change if you have a lot of rules
    • Any document library/records library with retention schedules will be irreversibly broken in that you can no longer manage the schedule.  All libraries will need to be re-created, content migrated, and retention schedules added
      • The symptoms of this are detailed below in section “Broken Records Library Symptoms”
      • I have found no way to get around this other than recreating all libraries!!!
      • A very time consuming change if you have a lot of libraries, content, or complex retention schedules
      • Note: to delete record libraries you need to remove all records, then run the "Hold Processing and Reporting" timer job in Central Administration.  When this is complete, you can delete the records library.  Thanks to Michal Pisarek on MSDN for this one.

    Broken Records Library Symptoms

    When trying to manage the retention settings on the records library I see the error:

    1. Click on Change source or configure library schedule:

    manage the retention schedule

    2. An error occurs:

    error

    3. And in the logs (nothing very helpful):

    Exception Thrown: StartIndex cannot be less than zero.

    Parameter name: startIndex Inner Exception: StartIndex cannot be less than zero.

    Parameter name: startIndex

    Recommendation

    If you are using a single records center for managing retention, putting the records center in it’s own site collection is a good idea.  This also adds isolates the content from a security perspective.  You can always manually add links to the record center where necessary for people that need to access it from the main application.

    Plan the Information Architecture up front – this one needs a good home from that start.

    Tuesday, December 14, 2010

    Delete document library option missing on settings page

    You can’t delete a document library if it contains records.  This is a sensible feature considering that documents declared as records would logically have some retention requirement…

    To delete:

    1. Undeclare your records or move them to another location
      1. If you’re working with a large document library with many folders etc. you could create a “Records View”, by filtering out any documents where the “Declared Record” field is not blank:
      2. image
      3. When all are deleted/moved, the delete option appears again in the settings page
    2. Delete the document library

    Monday, November 15, 2010

    PowerShell: file renaming en-masse

    You may have come across the following error message when uploading files in SharePoint:
    The file name is invalid or the file is empty. A file name cannot contain any of the following characters: \ / : * ? " < > | # { } % ~ &
    But what if you have hundreds or thousands of files that DO include one or more symbols, and you need to move them onto SharePoint?
    The following PowerShell script will replace a string in a filename with another string you provide.  It doesn’t alter file extensions or folders.  The script can be executed with command line parameters, however if they are not provided, it will prompt you for input.  The following parameters are used:
    • The directory e.g. “C:\test”
    • The text to find e.g. “&”
    • The replacement text e.g “and”
    • An optional switch that specifies whether to recurse through subfolders: –recurse
    An example of how to use this script would be:
    • fileRenamer.ps1 “C:\test” “&” “and” –recurse
    If you execute the script with no parameters, it will ask you to enter the directory, find, and replacement values in turn, but it won’t ask for whether you’d like to recurse.  It will default to non recursive.

    Error Handling

    The only issue this script handles is when the provisional new file name for a file already exists e.g. you are changing Q&A.doc to QandA.doc, and this filename is already in use.  In this instance, the script will notify the user that the rename operation has been cancelled and summary information is added to the end of the script.

    Script Output

    The script prints its progress to the console, and also provides some summary information at the end:
    image

    Disclaimer

    Use at your own risk. I will not be held responsible for any actions and/or results this script may cause.  If you choose to use this script; test it first, and only run on directories/files with backups as the implications of renaming large numbers of files are serious and the operation cannot be undone.

    The Code

    Paste this into a new file and save it as “fileRenamer.ps1” or similar.
    param(
        [switch]$recurse,
        [string]$directory = $(read-host "Please enter a directory"),
        [string]$find = $(read-host "Please enter a string to find"),
        [string]$replace = $(read-host "Please enter a replacement string")
    )

    # Startup text
    echo "This script will replace words in files of any type.  It will not modify file extensions."
    echo ""
    echo "WARNING: This action cannot be undone; use at your own risk!"
    echo ""

    # Setup variables
    $modifiedcounter = 0
    $counter = 0
    $exceptioncounter = 0
    $exceptionfilenames = ""
    $files = ""

    echo "Replacing text '$find' with '$replace' in all filenames..."
    echo ""

    if ($recurse)
    {
        # Grab all files recurse
        $files = get-childitem $directory -include *.* -recurse
    }
    else
    {
        $files = get-childitem $directory *.*
    }

    foreach ($file in $files) {
        $filename = $file.Name
       
        # Only run if this is a file
        if ($filename.IndexOf('.') -gt 0)
        {
            $name_noextension = $filename.SubString(0, $filename.LastIndexOf('.'))
            $extension = $filename.SubString($filename.LastIndexOf('.'), $filename.Length-$filename.LastIndexOf('.'))

            echo ("Filename:          " + $file.FullName)
            #echo "Extension removed: $name_noextension"
            #echo "Extension:         $extension"
           
            # If there is a match then attempt to rename the file
            if ([regex]::IsMatch($name_noextension, ".*$find.*")) {
                # Change the filename
                $name_noextension = $name_noextension.Replace($find, $replace);
                $newname = $name_noextension += $extension
               
                #echo "New name:          $newname"

                # Test to see whether a file already exists
                if (Test-Path ($file.DirectoryName + "\" + $newname))
                {
                    # A file already exists with that filename
                    echo ("Action:            A file exists with the name " + ($file.DirectoryName + "\" + $newname) + " so this action has been cancelled.")
                    $exceptioncounter++
                    $exceptionfilenames += ($file.DirectoryName + "\" + $newname)
                }
                else
                {
                    # Rename the file                
                    rename-item $file.FullName -newname $newname
                    echo "Action:            Filename modified from $filename to $newname."
                    $modifiedcounter++
                }
            }
            else
            {
                echo "Action:            None";
            }   

            $counter++
        }
    }

    # Output the results
    echo ""
    echo "$modifiedcounter of $counter filenames modified."
    echo "Exception count: $exceptioncounter"

    # If there were exceptions then output that as well
    if ($exceptioncounter -gt 0)
    {
        echo ""
        echo "The following files were not modified as files already existed with their provisional new names:"
        echo $exceptionfilenames
    }

    Read-host "Press enter to quit"

    Feedback

    Please let me know if you spot any errors in this script or ways in which it could be improved!