Download the DataSetSamples.exe sample code associated with the article.
Contents
Introduction
Raw Performance
The DataTable – More Independent Than Before
Stream to Cache, Cache to Stream
Conclusion
Introduction
In the upcoming release of ADO.NET, ADO.NET 2.0, there are many new and improved features that affect many different .NET Framework classes and application development scenarios. This article discusses on the changes and enhancement to the core disconnected mode ADO.NET Framework classes—the DataSet and associated classes such as DataSet, DataTable, and DataView.
This article is actually the first of two articles on the DataSet and associated classes in ADO.NET 2.0. Here we will focus on the classes in the .NET Framework. In the subsequent article, we will focus on developing with these and related classes from within the Visual Studio 2005 development environment. Visual Studio 2005 offers several designers and tools that offer tremendous flexibility and productivity for developing the data-centric aspects of your application. As a result, each article will have a different "feel". This article is mainly an overview of new functionality, accompanied by explanations and code samples. In the next article, the focus is more on the development process, as we see how to develop a working application.
As I mentioned above, this article only covers a small slice of the new features of ADO.NET 2.0. An overview of some of the other features can be found in ADO.NET 2.0 Feature Matrix. More in depth information on some of the topics mentioned there can be found these articles:
- Asynchronous Command Execution in ADO.NET 2.0
- Generic Coding with the ADO.NET 2.0 Base Classes and Factories
- Schemas in ADO.NET 2.0
Unless noted otherwise, the contents of this article are based on the Beta 1 release of Visual Studio 2005. The code samples use the Northwind database that comes as a sample database with SQL Server 2000.
Raw Performance
Software developers are always concerned with performance. Sometimes they get over-concerned and make their code jump through hoops to just trim a little execution time, in places where it ultimately isn't significant—but that is a subject for another article. When it comes to ADO.NET 1.x DataSets, particularly those containing a large amount of data, the performance concerns expressed by developers are indeed justified. Large DataSets are slow—in two different contexts. The first time the sluggish performance is felt is when loading a DataSet (actually, a DataTable) with a large number of rows. As the number of rows in a DataTable increases, the time to load a new row increases almost proportionally to the number of rows in the DataTable. The other time the performance hit is felt is when serializing and remoting a large DataSet. A key feature of the DataSet is the fact that it automatically knows how to serialize itself, especially when we want to pass it between application tiers. However, a close look reveals that this serialization is quite verbose, consuming much memory and network bandwidth. Both of these performance bottlenecks are addressed in ADO.NET 2.0.
New Indexing Engine
The indexing engine for the DataTable has been completely rewritten in ADO.NET 2.0 and scales much better for large datasets. This results in faster basic inserts, updates, and deletes, and therefore faster Fill and Merge operations. While benchmarks and quantifying performance gains is always an application-specific and often risky affair, these improvements clearly provide more than an order of magnitude improvement in loading a DataTable with a million rows. But don't take my word for it, check it out yourself, with the following simple example. Add the following code as the click event handler for a button on a Windows form:
Private Sub LoadButton_Click(ByVal sender As System.Object,
ByVal e As System.EventArgs) Handles LoadButton.Click
Dim ds As New DataSet
Dim time1 As New Date
Dim i As Integer
Dim dr As DataRow
ds.Tables.Add("BigTable")
ds.Tables(0).Columns.Add("ID", Type.GetType("System.Int32"))
ds.Tables(0).Columns("ID").Unique = True
ds.Tables(0).Columns.Add("Value", Type.GetType("System.Int32"))
' Show status label
WaitLabel.Visible = True
Me.Cursor = Cursors.WaitCursor
Me.Refresh()
' catch start time
time1 = DateTime.Now()
' Yes, we are loading a million rows to a DataTable!
'
' If you compile/run this with ADO.NET 1.1, you have time
' to make and enjoy a fresh pot of coffee...
Dim rand As New Random
Dim value As Integer
For i = 1 To 1000000
Try
value = rand.Next
dr = ds.Tables(0).NewRow()
dr("ID") = value
dr("Value") = value
ds.Tables(0).Rows.Add(dr)
Catch ex As Exception
' if there are any duplicate values, an exception
' will be thrown since the ID column was specified
' to be unique
End Try
Next
' reset cursor and label
WaitLabel.Visible = False
Me.Cursor = Me.DefaultCursor
' Show elapsed time, in seconds
MessageBox.Show("Elapsed Time: " & _
DateDiff(DateInterval.Second, time1, DateTime.Now))
' verify number of rows in the table
' This number will probably be less that the number
' of loop iterations, since if the same random number
' comes up, it will/can not be added to the table
MessageBox.Show("count = " & ds.Tables(0).Rows.Count)
End Sub
When I ran this code in my environment with ADO.NET 1.1 and Visual Studio 2003, the execution time was about 30 minutes. With ADO.NET 2.0 and Visual Studio 2005, I had an execution time of approximately 40-50 seconds! When I lowered the number of rows to only half a million, the 1.1 version took about 45 seconds and the 2.0 version took about 20 seconds. Your numbers will vary, but I think the point is clear.
In fact, this example is a very simple one, since it contains only one index, for the unique column. However, as the number of indices on the specified DataTable increases, such as by adding additional DataViews, UniqueKeys and ForeignKeys, the performance difference will be that much greater.
Note The reason the ID value in the sample code is being generated by a random number generator rather than just using the loop counter as the ID, is in order to better represent the real-world scenario. In real applications, accessing the elements of a DataTable for Inserts, Updates, and Deletes is rarely done sequentially. For each operation, the row specified by the unique key must first be located. When inserting and deleting rows, the table's indices must be updated. If we were to just load a million rows with sequentially key values into an empty table, the results would be extremely fast, but misleading.
Binary Serialization Option
The major performance improvement in loading a DataTable with a lot of data did not require us to make any change at all to our existing ADO.NET 1.x code. In order to benefit from improved performance when serializing the DataSet, we need to work a bit harder—we need to add a single line of code to set the new RemotingFormat property.
In ADO.NET 1.x, the DataSet serializes as XML, even when using the binary formatter. In ADO.NET 2.0, in addition to this behavior, we can also specify true binary serialization, by setting the RemotingFormat property to SerializationFormat.Binary rather than (the default) SerializationFormat.XML. Let us take a look at the different outputs resulting from these two different options.
In order to maintain backwards compatibility (about which the ADO.NET team was always concerned), the default value of XML serialization will give us the same behavior as in ADO.NET 1.x. The results of this serialization can be seen by running this code:
Private Sub XMLButton_Click(ByVal sender As System.Object,
ByVal e As System.EventArgs) Handles XMLButton.Click
Dim ds As New DataSet
Dim da As New SqlDataAdapter("select * from [order details]", _
GetConnectionString())
da.Fill(ds)
Dim bf As New BinaryFormatter
Dim fs As New FileStream("..\xml.txt", FileMode.CreateNew)
bf.Serialize(fs, ds)
End Sub
Private Function GetConnectionString() As String
' To avoid hard-coding the connection string in your code,
' use the application settings
Return MySettings.Value.NorthwindConnection
End Function
Note that this code is explicitly using the BinaryFormatter class, yet the output in file xml.txt, shown in Figure 1, is clearly XML. Also, in this case, the size of the file is 388 KB.
Let us now change the serialization format to binary by adding the line
ds.RemotingFormat = SerializationFormat.Binary
and save the data to a different file by modifying the filename in the FileStream constructor so that the code now looks like this:
Private Sub BinaryButton_Click(ByVal sender As System.Object,
ByVal e As System.EventArgs) Handles BinaryButton.Click
Dim ds As New DataSet
Dim da As New SqlDataAdapter("select * from [order details]", _
GetConnectionString())
da.Fill(ds)
Dim bf As New BinaryFormatter
Dim fs As New FileStream("..\binary.txt", FileMode.CreateNew)
ds.RemotingFormat = SerializationFormat.Binary
bf.Serialize(fs, ds)
End Sub
The output in file binary.txt is shown in Figure 2. Here we see that it is now in fact binary data, pretty unintelligible to the human reader. Moreover, the size of this file is only 59 KB—again, an order of magnitude reduction in the amount of data that needs to be transferred and the CPU, memory, and bandwidth resources required to process it. It should be pointed out that this improvement is relevant when using remoting and not when using Web Services, since Web Services by definition must be passing XML. This means that you will only be able to take advantage of this enhancement when both sides of the communication are .NET-based and not when communicating to non-.NET platforms.
More in-depth details about DataSet serialization process can be found in Binary Serialization of DataSets.
The DataTable – More Independent Than Before
When discussing ADO.NET 1.x and its object model for disconnected data access, the central object was the DataSet. Sure it contained other objects, such the DataTable, DataRelation, DataRow, etc., but the attention generally started and revolved around the DataSet. It is true that most .NET developers were aware and leveraged the fact that the DataTable was quite useful on its own, without being encapsulated inside a DataSet. However, there were some scenarios where we couldn't do what we wanted to do with a DataTable unless we first took it and forced it into a DataSet. The most glaring and often painful example of this is to read and write (load and save) XML data in to and out of the DataTable. In ADO.NET 1.x, we must first add the DataTable to DataSet, just so we could read or write XML, since the methods to do so are only available on the DataSet!
One of the objectives of ADO.NET 2.0 was to make the standalone DataTable class far more functional and useful than it is in ADO.NET 1.x. The DataTable now supports the basic methods for XML, just as the DataSet does. This includes the following methods:
- ReadXML
- ReadXMLSchema
- WriteXML
- WriteXMLSchema
The DataTable is independently serializable and can be used in both web service and remoting scenarios. In addition to now supporting the Merge method, the stand-alone DataTable also supports new ADO.NET 2.0 features added to the DataSet:
- RemotingFormat property (discussed previously)
- Load method (discussed later in this article)
- GetDataReader method (discussed later in this article)
Note On the topic of XML, it is worth noting that in ADO.NET 2.0 there is much enhanced XML support—what Microsoft likes to call greater "XML Fidelity". This takes the form of support for the SQL Server 2005 XML data type, extended XSD schema support, an improved XSD schema inference engine, and the elimination of two often troublesome limitations: (i) The DataSet and DataTable classes can now handle multiple in-line schemas and (ii) The DataSet now fully supports namespaces, so that a DataSet can contain multiple DataTables with the same name, but from different namespaces, i.e., tables with the same unqualified names, but with different qualified names. Also, a child table with the same name and namespace that is included in multiple relations can be nested in multiple parent tables.
Stream to Cache, Cache to Stream
Another one of the main enhancements for the DataSet and DataTable classes in ADO.NET 2.0 is the availability of mechanisms to consume a DataReader (loading data into DataTables) and to expose a DataReader over the contents of DataTables.
Sometimes we have/receive our data in the form of a DataReader, but really want to have it in the form of a cached DataTable. The new Load method allows us to take an existing DataReader and use it to fill a DataTable with its contents.
Sometimes we have/receive our data in a cached form (DataTable) and need to access it via a DataReader type interface. The new GetTableReader method allows us to take an existing DataTable and access it with a DataReader interface and semantics.
In the following sections, we'll take a look at these new methods.
The Load Method – Basic Use
The Load method is a new method that has been added to the DataSet and the DataTable in ADO.NET 2.0. It loads a DataTable with the contents of a DataReader object. It can actually load multiple tables at one time, if the DataReader contains multiple resultsets.
The basic use of the Load method is quite straightforward:
MyDataTable.Load (MyDataReader)
A more complete illustration of its use is shown in this sample code:
Private Sub LoadButton_Click(ByVal sender As System.Object,
ByVal e As System.EventArgs) Handles LoadButton.Click
Try
Using connection As New SqlConnection(GetConnectionString())
Using command As New SqlCommand("SELECT * from customers", connection)
connection.Open()
Using dr As SqlDataReader = command.ExecuteReader()
'Fill table with data from DataReader
Dim dt As New DataTable
dt.Load(dr, LoadOption.OverwriteRow)
' Display the data
DataGridView1.DataSource = dt
End Using
End Using
End Using
Catch ex As SqlException
MessageBox.Show(ex.Message)
Catch ex As InvalidOperationException
MessageBox.Show(ex.Message)
Catch ex As Exception
' You might want to pass these errors
' back out to the caller.
MessageBox.Show(ex.Message)
End Try
End Sub
The code above initializes connection and command objects and then executes the ExecuteReader method to fetch the data from the database. The results of the query are provided as a DataReader, which is then passed to the Load method of the DataTable to fill it with the returned data. Once the DataTable is filled with the data, it can be bound and displayed in the DataGridView. The significance of the OverwriteRow load option for the (optional) LoadOption parameter will be explained in the next section.
The Load Method – Why am I loading this data?
If all you are doing with your DataSet/DataTable and DataAdapter is filling the DataSet with data from the data source, modifying that data, and then at some later point pushing it back into the data source, that things general move pretty smoothly. A first complication occurs if you are utilizing optimistic concurrency and a concurrency violation is detected (someone else already changed one of the rows you are trying to change). In this case what you normally need to do to resolve the conflict is to resynchronize the DataSet with the data source, so that the original values for the rows match the current database values. This can be accomplished by merging a DataTable with the new values into the original table (in ADO.NET 1.x, the merge method is only available on the DataSet):
OriginalTable.Merge(NewTable, True)
By matching rows with the same primary key, records in the new table are merged with the records in the original table. Of key significance here is the second parameter, PreserveChanges. This specifies that the merge operation should only update the original values for each row, and not affect the current values for the row. This allows the developer to subsequently execute a DataAdapter.Update that will now succeed in updating the data source with the changes (current values), since the original values now match the current data source values. If PreserveChanges is left at its default value of false, the merge would override both the original and current values of the rows in the original DataTable and all of the changes that were made would be lost.
However, sometimes we want to update data in the data source, where the new values don't come from programmatically modifying the values. Perhaps we obtain updated values from another database or from an XML source. In this scenario, we want to update the current values of the rows in the DataTable, but not affect the original values for those rows. There is no easy way to do this in ADO.NET 1.x. It is for this reason that the ADO.NET 2.0 Load method accepts a parameter LoadOption that indicates how to combine the new incoming rows with the same (primary key) rows already in the DataTable.
The LoadOption allows us to explicitly specify what our intention is when loading the data (synchronization or aggregation) and how we therefore want to merge the new and existing rows. Figure 3 outlines the various scenarios:
Where:
- Primary Data Source—DataTable/DataSet synchronizes/updates with only one Primary Data Source. It will track changes to allow for synchronization with primary data sources.
- Secondary Data Source—DataTable/DataSet accepts incremental data feeds from one or more Secondary Data Sources. It is not responsible for tracking changes for the purpose of synchronization with secondary data sources.
The three cases shown in Figure 3 can be summarized as follows:
- Case 1—Initialize DataTable(s) from Primary Data Source. The user wants to initialize an empty DataTable (original values and current values) with values from primary data source and then later, after changes have been made to this data, propagate the changes back to the primary data source.
- Case 2—Preserve Changes and Re-Sync from Primary Data Source. The user wants to take the modified DataTable and re-synchronize its contents (original values only) with the primary data source while maintaining the changes made (current values)
- Case 3—Aggregate incremental data feeds from one or more Secondary Data Sources. The user wants to accept changes (current values) from one or more secondary data sources and then propagate these changes back to the primary data source.
The LoadOption enumeration has three values that respectively represent these three scenarios:
- OverwriteRow—Update the current and original versions of the row with the value of the incoming row.
- PreserveCurrentValues (default)—Update original version of the row with the value of the incoming row.
- UpdateCurrentValues—Update the current version of the row with the value of the incoming row.
Note These names will probably change post-Beta 1.
Table 1 below summarizes the load semantics. If the incoming row and existing row agree on primary key values, then the row is processed using its existing DataRowState, else use 'Not Present' section (the last row in the table).
Table 1. Summary of Load Semantics
Existing DataRow State | UpdateCurrentValues | OverwriteRow | PreserveCurrentValues (Default) |
Added | Current = Original = - -- State = | Current = Original = State = | Current = Original = State = |
Modified | Current = Original = State = | Current = Original = State = | Current = Original = State = |
Deleted | (Undo Delete) and Current = Original = State = <> | (Undo Delete) and Current = Original = State = | Current = Original = State = |
Unchanged | Current = Original = If new value same as existing value then State = Else State = | Current = Original = State = | Current = Original = State = |
Not Present | Current = Original = --- State = <> | Current = Original = State = | Current = Original = State = |
Example
In order to illustrate the behavior specified in Table 1, I offer a simple example.
Assume that both the existing DataRow and incoming row have 2 columns with matching names. The first column is the primary key and the second column contains a numeric value. The tables below show the contents of the second column in the data rows.
Table 2 represents the contents of a row in all 4 states before invoking Load. The incoming row's second column value is 3. Table 3 shows its contents after load.
Table 2. Row State Before Load
Existing Row State | Version | Added | Modified | Deleted | Unchanged |
| Current | 2 | 2 | - | 4 |
| Original | - | 4 | 4 | 4 |
Incoming Row
Incoming Row |
3 |
Table 3. Row State After Load
| UpdateCurrentValues | OverwriteRow | PreserveCurrentValues |
Added | Current = <3> Original = --- State = | Current = <3> Original = <3> State = | Current = <2> Original = <3> State = |
Modified | Current = <3> Original = <4> State = | Current = <3> Original = <3> State = | Current = <2> Original = <3> State = |
Deleted | Current = <3> Original = <4> State = | Current = <3> Original = <3> State = | Current = <2> Original = <3> State = |
Unchanged | Current = <3> Original = <4> State = | Current = <3> Original = <3> State = | Current = <3> Original = <3> State = |
Not Present | Current = <3> Original = --- State = | Current = <3> Original = <3> State = | Current = <3> Original = <3> State = |
Note You can see the beginnings of this concept already in ADO.NET 1.x. The default behavior of the DataAdapter's Fill method when loading data into a DataTable is to mark all the rows as Unchanged (This can be overridden by setting the AcceptChangesOnFill property to False). However, when using ReadXML to load data into a DataSet, the rows are marked as Added. The rationale for this (which was implemented based on customer feedback) is that this would allow loading new data from an XML source into a DataSet and then using the associated DataAdapter to update the primary data source. If the rows were marked as Unchanged when loaded from ReadXML, the DataAdapter.Update would not detect and changes and would not execute any commands against the data source.
In order to provide similar functionality, the FillLoadOptions property has been added to the DataAdapter in order to offer the same semantics and behavior as the Load method described here, while still preserving the same (by default) existing behavior of the Fill method.
Another feature (which doesn't exist) that developers always ask about in ADO.NET 1.x, is the ability to manually modify the state of DataRow. While the options offered by the Load method may address most scenarios, you may still want to have finer-grained control over the row state—you may have a need to modify the state of individual rows. To that end, ADO.NET 2.0 introduces two new methods on the DataRow class: SetAdded and SetModified. Before you ask about setting the state to Deleted, or Unchanged, let me remind you that with version 1.x we already have the Delete and AcceptChanges/RejectChanges methods to accomplish this.
The GetTableReader Method
The GetTableReader method is a new method that has been added to the DataSet and the DataTable in ADO.NET 2.0. It returns the contents of a DataTable as a DataTableReader (derived from DBDataReader) object. If it is invoked on a DataSet that contains multiple tables, the DataReader will contain multiple resultsets.
The use of the GetTableReader method is quite straight-forward:
Dim dtr As DataTableReader = ds.Tables(0).GetDataReader
The DataTableReader works pretty much like the other data readers you have worked with, such as the SqlDataReader or OleDbDataReader. The difference is, however, that rather than streaming data from a live database connection, the DataTableReader provides iteration over the rows of a disconnected DataTable.
The DataTableReader provides a smart, stable iterator. The cached data may be modified while the DataTableReader is active and the reader will automatically maintain its position appropriately even if one or more rows are deleted or inserted while iterating.
A DataTableReader that is created by calling GetDataReader on a DataTable contains one result set with the same data as the DataTable from which it was created. The result set contains only the current column values for each DataRow and rows that are marked for deletion are skipped. A DataTableReader that is created by calling GetDataReader on a DataSet that contains more than one table will contain multiple result sets. The result sets will be in the same sequence as the DataTable objects in the DataSet object's DataTableCollection.
In addition to the features outlined above, another great use of the GetDataReader method is to quickly copy data from one DataTable to another:
Dim dt2 as new DataTable
dt2.Load(ds.Tables(0).GetDataReader)
The DataView.ToTable Method
Another new method that is somewhat related to the previous ones (in that it provides a new DataTable cache of existing data) and is worth mentioning is the ToTable method of the DataView class. As a reminder, the DataView class provides a logical view of the rows in a DataTable. This view may be filtered by row, row state, and sorted. However, in ADO.NET 1.1, there is no easy way to save or pass on the rows of the view, since the DataView does not have its own copy of the rows—it simply accesses the rows of the underlying DataTable as prescribed by the filter and sort parameters. The DataView's ToTable method returns an actual DataTable object that is populated with rows of the exposed by the current view.
Overloaded versions of the ToTable method offer the option of specifying the list of columns to be included in the created table. The generated table will contain the listed columns in the specified sequence, which may differ from the original table/view. This ability to limit the number of columns in a view is a feature that is missing in ADO.NET 1.x and has frustrated many a .NET programmer. You can also specify the name of the created table and whether it should contain all or only distinct rows.
Here is some sample code that shows how to use the ToTable method:
Private Sub ToTableButton_Click(ByVal sender As System.Object,
ByVal e As System.EventArgs) Handles ToTableButton.Click
' Show only 2 columns in second grid
Dim columns As String() = {"CustomerID", "ContactName"}
Dim dt As DataTable = _
ds.Tables("customers").DefaultView.ToTable( _
"SmallCustomers", False, columns)
DataGridView2.DataSource = dt
End Sub
Assuming that the contents of the "customers" table in the DataSet ds are displayed in a first grid, this routine displays the newly created DataTable that contains only those rows exposed by the DefaultView (as specified by its filter parameters). The rows in the new table contain only two of the columns of the original DataTable and DataView. An example of this can be seen in Figure 4.
Conclusion
The ADO.Net 2.0 version of the DataSet (and DataTable) introduces numerous new features and enhancements to existing features. The main features, discussed in the article, include significantly improved performance due to a new index engine and the binary serialization format option, extensive capabilities available to a stand-alone DataTable, and mechanisms for exposing cached data as a stream (DataReader) and loading stream data into a DataTable cache. ADO.NET 2.0 also offers greater control over the state of rows in a DataTable, in order to better address more real-world scenarios.
Thanks to Mayank nagar, Kawarjit S. Bedi, Pablo Castro, Alan Griver, Steve Lasker, and Paul Yuknewicz of Microsoft for their help in preparing this article.