C# – Parsing CSV data when a field has commas

When you have commas in your CSV fields, it creates a conflict with the field delimiting commas. In other words, you can’t tell which data belongs to which field. How you deal with this will depend on one question: is the field with the comma enclosed in quotes?

Comma is enclosed in quotes

Spreadsheet programs and good CSV generators will automatically detect commas in fields and enclose the field with quotes. CSV with quote-enclosed commas look like this:

Title,Year
"I, Tonya",2017
Inception,2010
Code language: plaintext (plaintext)

If you’re running into this situation, don’t try to parse the CSV manually. Instead, use a CSV parser library, such as CsvHelper. This automatically handles quote-enclosed commas.

To use CsvHelper, first install the package:

Install-Package CsvHelper
Code language: PowerShell (powershell)

Note: This is using Package Manager Console (View > Other Windows).

And here’s an example of using CsvHelper to parse the CSV data shown above (contained in a file):

using CsvHelper;
using System.Globalization;

using (var reader = new StreamReader(@"C:\movies.csv"))
using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
{
	foreach (var movie in csv.GetRecords<Movie>())
	{
		Console.WriteLine($"{movie.Title} was released in {movie.Year}");
	}
}
Code language: C# (cs)

This outputs the following, showing that it correctly handled the field with the comma in it:

I, Tonya was released in 2017
Inception was released in 2010
Code language: plaintext (plaintext)

Comma is not enclosed in quotes

Here’s what it looks like when you have commas in your fields and they aren’t enclosed in quotes:

Title,Year
I, Tonya,2017
Inception,2010
Code language: plaintext (plaintext)

This is a harder problem to deal with, but it’s not impossible.

This is malformed CSV data, because it doesn’t follow the standard of quote-enclosing commas. If you are in control of generating the CSV data in the first place, then you can fix it by quote-enclosing the commas (or using a different delimiter, such as a tab). Then you can use a parser to deal with it (as shown in the previous section).

If you can’t fix the data, and really must deal with it as is, then one solution is to merge the slots that belong to the string field. Here’s the algorithm for doing this:

  • Split each line by a comma into an array, then convert the array into a list (necessary for easy removal/insertion in the merge step).
  • If the line has more fields than expected, then merge the string fields:
    • Figure out how many “parts” should be merged together into one field.
    • Merge the “parts” together by joining the strings with a comma.
    • Remove the “parts” that you just removed and then insert the merged string into the list.

The following code implements this approach:

using System.IO;
using System.Linq;

var csvFilePath = @"C:\movies.csv";
var expectedFieldCount = 2;

foreach (var line in File.ReadLines(csvFilePath).Skip(1)) //skip the header row
{
    var fields = line.Split(',').ToList();

    //Check if the string field had commas and then merge its parts together
    if (fields.Count > expectedFieldCount)
    {
        int mergeCount = 1 + (fields.Count - expectedFieldCount);
        var mergedString = string.Join(",", fields.Take(mergeCount));
        fields.RemoveRange(0, mergeCount);
        fields.Insert(0, mergedString);
    }

    Console.WriteLine($"{fields[0]} was released in {fields[1]}");
}
Code language: C# (cs)

Here’s what this code does step by step on the problematic CSV example containing the “I,Tonya,2017” line:

  • Split “I, Tonya,2017” with a comma, giving [I][Tonya][2017].
  • This has 3 fields, while except 2 (Title, Year). Therefore, we know the string field (the Title) has a comma in it.
  • Join [I][Tonya] with a comma, giving “I,Tonya”.
  • Remove [I][Tonya] from the list, leaving just [2017].
  • Insert the fixed string “I,Tonya” into the start of the list, resulting in [I,Tonya][2017].

This outputs the following, showing that it correctly handled the field with the comma in it:

I, Tonya was released in 2017
Inception was released in 2010Code language: plaintext (plaintext)

To repeat: this approach requires you to know that a string field *might* have commas in it. If you are dealing with problematic CSV data like this, this is just a hoop you have to jump through to parse it.

Leave a Comment