NickName:sashoalm Ask DateTime:2013-03-20T17:48:12

Parsing CSV file with multiline fields and escaped double quotes

What is the best way to parse a CSV file with multiline fields and escaped quotes?

For example, this CSV

First field of first row,"This field is multiline

but that's OK because it's enclosed in double qoutes, and this
is an escaped "" double qoute" but this one "" is not
   "This is second field of second row, but it is not multiline
because it doesn't start 
with an immediate double quote"

looks in Excel like this:

enter image description here

Do I just preserve the state somehow, have a flag saying if the field I'm reading now has started with a qoute, etc.? Also, what happens to cases where there are double quotes but not at the start of the field, i.e. , "ABC", or ,"item "" item" "" are those considered invalid? Also, are "" quotes escaped if not inside a quoted field? Excel doesn't seem to.

Are there any other corner cases I might have missed?

Copyright Notice：Content Author:「sashoalm」，Reproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article：https://stackoverflow.com/questions/15520113/parsing-csv-file-with-multiline-fields-and-escaped-double-quotes

Answers

Jonathan Leffler 2013-03-20T14:38:15

Your sample data has a single double quote just after 'double qoute' (sic) which terminates the double quoted start of field, but you have to continue reading until the next comma or the end of line. This is a malformed multi-line field, but you're seeing what Excel does with it. You can find a description of this in the (excellent) book The Practice of Programming (dead link — Internet Archive copy; live link at Princeton University) which includes a CSV parsing library in C and a reimplementation in C++, and which discusses this level of detail. There's also a standard for CSV in RFC 4180 \"Common Format and MIME Type for Comma-Separated Values\" and you can also study Wikipedia on the subject.\n\n\n\nIn the other answer is some sample code, still under test. It is pretty resilient within its limitations. Here it is modified into an SSCCE (Short, Self-Contained, Correct Example).\n\n#include <stdbool.h>\n#include <wchar.h>\n#include <wctype.h>\n\nextern const wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline);\n\n// Returns a pointer to the start of the next field,\n// or zero if this is the last field in the CSV\n// p is the start position of the field\n// sep is the separator used, i.e. comma or semicolon\n// newline says whether the field ends with a newline or with a comma\nconst wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline)\n{\n // Parse quoted sequences\n if ('\"' == p[0]) {\n p++;\n while (1) {\n // Find next double-quote\n p = wcschr(p, L'\"');\n // If we don't find it or it's the last symbol\n // then this is the last field\n if (!p || !p[1])\n return 0;\n // Check for \"\", it is an escaped double-quote\n if (p[1] != '\"')\n break;\n // Skip the escaped double-quote\n p += 2;\n }\n }\n\n // Find next newline or comma.\n wchar_t newline_or_sep[4] = L\"\\n\\r \";\n newline_or_sep[2] = sep;\n p = wcspbrk(p, newline_or_sep);\n\n // If no newline or separator, this is the last field.\n if (!p)\n return 0;\n\n // Check if we had newline.\n *newline = (p[0] == '\\r' || p[0] == '\\n');\n\n // Handle \"\\r\\n\", otherwise just increment\n if (p[0] == '\\r' && p[1] == '\\n')\n p += 2;\n else\n p++;\n\n return p;\n}\n\nstatic void dissect(const wchar_t *line)\n{\n const wchar_t *start = line;\n const wchar_t *next;\n bool eol;\n wprintf(L\"Input: %d [%.*ls]\\n\", wcslen(line), wcslen(line)-1, line);\n while ((next = nextCsvField(start, L',', &eol)) != 0)\n {\n wprintf(L\"Field: [%.*ls] (eol = %d)\\n\", (next - start - eol), start, eol);\n start = next;\n }\n}\n\nstatic const wchar_t multiline[] =\n L\"First field of first row,\\\"This field is multiline\\n\"\n \"\\n\"\n \"but that's OK because it's enclosed in double quotes, and this\\n\"\n \"is an escaped \\\"\\\" double quote\\\" but this one \\\"\\\" is not\\n\"\n \" \\\"This is second field of second row, but it is not multiline\\n\"\n \" because it doesn't start \\n\"\n \" with an immediate double quote\\\"\\n\"\n ;\n\nint main(void)\n{\n wchar_t line[1024];\n\n while (fgetws(line, sizeof(line)/sizeof(line[0]), stdin))\n dissect(line);\n dissect(multiline);\n\n return 0;\n}\n\n\nExample output\n\n$ cat csv.data\na,bb, c ,d\"\"e,f\n1,\"2\",\"\",,\"\"\"\",4\n$ ./wcsv < csv.data\nInput: 16 [a,bb, c ,d\"\"e,f]\nField: [a,] (eol = 0)\nField: [bb,] (eol = 0)\nField: [ c ,] (eol = 0)\nField: [d\"\"e,] (eol = 0)\nField: [f] (eol = 1)\nInput: 17 [1,\"2\",\"\",,\"\"\"\",4]\nField: [1,] (eol = 0)\nField: [\"2\",] (eol = 0)\nField: [\"\",] (eol = 0)\nField: [,] (eol = 0)\nField: [\"\"\"\",] (eol = 0)\nField: [4] (eol = 1)\nInput: 296 [First field of first row,\"This field is multiline\n\nbut that's OK because it's enclosed in double quotes, and this\nis an escaped \"\" double quote\" but this one \"\" is not\n \"This is second field of second row, but it is not multiline\n because it doesn't start \n with an immediate double quote\"]\nField: [First field of first row,] (eol = 0)\nField: [\"This field is multiline\n\nbut that's OK because it's enclosed in double quotes, and this\nis an escaped \"\" double quote\" but this one \"\" is not] (eol = 1)\nField: [ \"This is second field of second row,] (eol = 0)\nField: [ but it is not multiline] (eol = 1)\nField: [ because it doesn't start ] (eol = 1)\nField: [ with an immediate double quote\"] (eol = 1)\n$\n\n\nI said 'within its limitations'; what are its limitations?\n\nPrimarily, it isolates the raw field, rather than the converted field. Thus, the field it isolates has to be modified to produce the 'real' value, with the enclosing double quotes stripped off, and internal doubled double quotes replaced by single quotes. Converting a raw field to the real value mimics quite a lot of the code in the nextCsvField() function. The inputs are the start of the field and the end of the field (the separator character). Here's a second SSCCE with an extra function csvFieldData(), and the dissect() function shown above revised to call it. The format of the dissected output is slightly different so it looks better:\n\n#include <stdbool.h>\n#include <wchar.h>\n#include <wctype.h>\n\nextern const wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline);\n\n// Returns a pointer to the start of the next field,\n// or zero if this is the last field in the CSV\n// p is the start position of the field\n// sep is the separator used, i.e. comma or semicolon\n// newline says whether the field ends with a newline or with a comma\nconst wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline)\n{\n // Parse quoted sequences\n if ('\"' == p[0]) {\n p++;\n while (1) {\n // Find next double-quote\n p = wcschr(p, L'\"');\n // If we don't find it or it's the last symbol\n // then this is the last field\n if (!p || !p[1])\n return 0;\n // Check for \"\", it is an escaped double-quote\n if (p[1] != '\"')\n break;\n // Skip the escaped double-quote\n p += 2;\n }\n }\n\n // Find next newline or comma.\n wchar_t newline_or_sep[4] = L\"\\n\\r \";\n newline_or_sep[2] = sep;\n p = wcspbrk(p, newline_or_sep);\n\n // If no newline or separator, this is the last field.\n if (!p)\n return 0;\n\n // Check if we had newline.\n *newline = (p[0] == '\\r' || p[0] == '\\n');\n\n // Handle \"\\r\\n\", otherwise just increment\n if (p[0] == '\\r' && p[1] == '\\n')\n p += 2;\n else\n p++;\n\n return p;\n}\n\nstatic wchar_t *csvFieldData(const wchar_t *fld_s, const wchar_t *fld_e, wchar_t *buffer, size_t buflen)\n{\n wchar_t *dst = buffer;\n wchar_t *end = buffer + buflen - 1;\n const wchar_t *src = fld_s;\n\n if (*src == L'\"')\n {\n const wchar_t *p = src + 1;\n while (p < fld_e && dst < end)\n {\n if (p[0] == L'\"' && p+1 < fld_s && p[1] == L'\"')\n {\n *dst++ = p[0];\n p += 2;\n }\n else if (p[0] == L'\"')\n {\n p++;\n break;\n }\n else\n *dst++ = *p++;\n }\n src = p;\n }\n while (src < fld_e && dst < end)\n *dst++ = *src++;\n if (dst >= end)\n return 0;\n *dst = L'\\0';\n return(buffer);\n}\n\nstatic void dissect(const wchar_t *line)\n{\n const wchar_t *start = line;\n const wchar_t *next;\n bool eol;\n wprintf(L\"Input %3zd: [%.*ls]\\n\", wcslen(line), wcslen(line)-1, line);\n while ((next = nextCsvField(start, L',', &eol)) != 0)\n {\n wchar_t buffer[1024];\n wprintf(L\"Raw Field: [%.*ls] (eol = %d)\\n\", (next - start - eol), start, eol);\n if (csvFieldData(start, next-1, buffer, sizeof(buffer)/sizeof(buffer[0])) != 0)\n wprintf(L\"Field %3zd: [%ls]\\n\", wcslen(buffer), buffer);\n start = next;\n }\n}\n\nstatic const wchar_t multiline[] =\n L\"First field of first row,\\\"This field is multiline\\n\"\n \"\\n\"\n \"but that's OK because it's enclosed in double quotes, and this\\n\"\n \"is an escaped \\\"\\\" double quote\\\" but this one \\\"\\\" is not\\n\"\n \" \\\"This is second field of second row, but it is not multiline\\n\"\n \" because it doesn't start \\n\"\n \" with an immediate double quote\\\"\\n\"\n ;\n\nint main(void)\n{\n wchar_t line[1024];\n\n while (fgetws(line, sizeof(line)/sizeof(line[0]), stdin))\n dissect(line);\n dissect(multiline);\n\n return 0;\n}\n\n\nExample output\n\n$ ./wcsv < csv.data\nInput 16: [a,bb, c ,d\"\"e,f]\nRaw Field: [a,] (eol = 0)\nField 1: [a]\nRaw Field: [bb,] (eol = 0)\nField 2: [bb]\nRaw Field: [ c ,] (eol = 0)\nField 3: [ c ]\nRaw Field: [d\"\"e,] (eol = 0)\nField 4: [d\"\"e]\nRaw Field: [f] (eol = 1)\nField 1: [f]\nInput 17: [1,\"2\",\"\",,\"\"\"\",4]\nRaw Field: [1,] (eol = 0)\nField 1: [1]\nRaw Field: [\"2\",] (eol = 0)\nField 1: [2]\nRaw Field: [\"\",] (eol = 0)\nField 0: []\nRaw Field: [,] (eol = 0)\nField 0: []\nRaw Field: [\"\"\"\",] (eol = 0)\nField 2: [\"\"]\nRaw Field: [4] (eol = 1)\nField 1: [4]\nInput 296: [First field of first row,\"This field is multiline\n\nbut that's OK because it's enclosed in double quotes, and this\nis an escaped \"\" double quote\" but this one \"\" is not\n \"This is second field of second row, but it is not multiline\n because it doesn't start \n with an immediate double quote\"]\nRaw Field: [First field of first row,] (eol = 0)\nField 24: [First field of first row]\nRaw Field: [\"This field is multiline\n\nbut that's OK because it's enclosed in double quotes, and this\nis an escaped \"\" double quote\" but this one \"\" is not] (eol = 1)\nField 140: [This field is multiline\n\nbut that's OK because it's enclosed in double quotes, and this\nis an escaped \" double quote\" but this one \"\" is not]\nRaw Field: [ \"This is second field of second row,] (eol = 0)\nField 38: [ \"This is second field of second row]\nRaw Field: [ but it is not multiline] (eol = 1)\nField 24: [ but it is not multiline]\nRaw Field: [ because it doesn't start ] (eol = 1)\nField 28: [ because it doesn't start ]\nRaw Field: [ with an immediate double quote\"] (eol = 1)\nField 34: [ with an immediate double quote\"]\n$ \n\n\nI've not tested with \\r\\n (or plain \\r) line endings.",