Parsing CSV file with multiline fields and escaped double quotes
NickName:sashoalm Ask DateTime:2013-03-20T17:48:12

Parsing CSV file with multiline fields and escaped double quotes

What is the best way to parse a CSV file with multiline fields and escaped quotes?

For example, this CSV

First field of first row,"This field is multiline

but that's OK because it's enclosed in double qoutes, and this
is an escaped "" double qoute" but this one "" is not
   "This is second field of second row, but it is not multiline
because it doesn't start 
with an immediate double quote"

looks in Excel like this:

enter image description here

Do I just preserve the state somehow, have a flag saying if the field I'm reading now has started with a qoute, etc.? Also, what happens to cases where there are double quotes but not at the start of the field, i.e. , "ABC", or ,"item "" item" "" are those considered invalid? Also, are "" quotes escaped if not inside a quoted field? Excel doesn't seem to.

Are there any other corner cases I might have missed?

Copyright Notice:Content Author:「sashoalm」,Reproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/15520113/parsing-csv-file-with-multiline-fields-and-escaped-double-quotes

Answers
Jonathan Leffler 2013-03-20T14:38:15

Your sample data has a single double quote just after 'double qoute' (sic) which terminates the double quoted start of field, but you have to continue reading until the next comma or the end of line. This is a malformed multi-line field, but you're seeing what Excel does with it. You can find a description of this in the (excellent) book The Practice of Programming (dead link — Internet Archive copy; live link at Princeton University) which includes a CSV parsing library in C and a reimplementation in C++, and which discusses this level of detail. There's also a standard for CSV in RFC 4180 \"Common Format and MIME Type for Comma-Separated Values\" and you can also study Wikipedia on the subject.\n\n\n\nIn the other answer is some sample code, still under test. It is pretty resilient within its limitations. Here it is modified into an SSCCE (Short, Self-Contained, Correct Example).\n\n#include <stdbool.h>\n#include <wchar.h>\n#include <wctype.h>\n\nextern const wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline);\n\n// Returns a pointer to the start of the next field,\n// or zero if this is the last field in the CSV\n// p is the start position of the field\n// sep is the separator used, i.e. comma or semicolon\n// newline says whether the field ends with a newline or with a comma\nconst wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline)\n{\n // Parse quoted sequences\n if ('\"' == p[0]) {\n p++;\n while (1) {\n // Find next double-quote\n p = wcschr(p, L'\"');\n // If we don't find it or it's the last symbol\n // then this is the last field\n if (!p || !p[1])\n return 0;\n // Check for \"\", it is an escaped double-quote\n if (p[1] != '\"')\n break;\n // Skip the escaped double-quote\n p += 2;\n }\n }\n\n // Find next newline or comma.\n wchar_t newline_or_sep[4] = L\"\\n\\r \";\n newline_or_sep[2] = sep;\n p = wcspbrk(p, newline_or_sep);\n\n // If no newline or separator, this is the last field.\n if (!p)\n return 0;\n\n // Check if we had newline.\n *newline = (p[0] == '\\r' || p[0] == '\\n');\n\n // Handle \"\\r\\n\", otherwise just increment\n if (p[0] == '\\r' && p[1] == '\\n')\n p += 2;\n else\n p++;\n\n return p;\n}\n\nstatic void dissect(const wchar_t *line)\n{\n const wchar_t *start = line;\n const wchar_t *next;\n bool eol;\n wprintf(L\"Input: %d [%.*ls]\\n\", wcslen(line), wcslen(line)-1, line);\n while ((next = nextCsvField(start, L',', &eol)) != 0)\n {\n wprintf(L\"Field: [%.*ls] (eol = %d)\\n\", (next - start - eol), start, eol);\n start = next;\n }\n}\n\nstatic const wchar_t multiline[] =\n L\"First field of first row,\\\"This field is multiline\\n\"\n \"\\n\"\n \"but that's OK because it's enclosed in double quotes, and this\\n\"\n \"is an escaped \\\"\\\" double quote\\\" but this one \\\"\\\" is not\\n\"\n \" \\\"This is second field of second row, but it is not multiline\\n\"\n \" because it doesn't start \\n\"\n \" with an immediate double quote\\\"\\n\"\n ;\n\nint main(void)\n{\n wchar_t line[1024];\n\n while (fgetws(line, sizeof(line)/sizeof(line[0]), stdin))\n dissect(line);\n dissect(multiline);\n\n return 0;\n}\n\n\nExample output\n\n$ cat csv.data\na,bb, c ,d\"\"e,f\n1,\"2\",\"\",,\"\"\"\",4\n$ ./wcsv < csv.data\nInput: 16 [a,bb, c ,d\"\"e,f]\nField: [a,] (eol = 0)\nField: [bb,] (eol = 0)\nField: [ c ,] (eol = 0)\nField: [d\"\"e,] (eol = 0)\nField: [f] (eol = 1)\nInput: 17 [1,\"2\",\"\",,\"\"\"\",4]\nField: [1,] (eol = 0)\nField: [\"2\",] (eol = 0)\nField: [\"\",] (eol = 0)\nField: [,] (eol = 0)\nField: [\"\"\"\",] (eol = 0)\nField: [4] (eol = 1)\nInput: 296 [First field of first row,\"This field is multiline\n\nbut that's OK because it's enclosed in double quotes, and this\nis an escaped \"\" double quote\" but this one \"\" is not\n \"This is second field of second row, but it is not multiline\n because it doesn't start \n with an immediate double quote\"]\nField: [First field of first row,] (eol = 0)\nField: [\"This field is multiline\n\nbut that's OK because it's enclosed in double quotes, and this\nis an escaped \"\" double quote\" but this one \"\" is not] (eol = 1)\nField: [ \"This is second field of second row,] (eol = 0)\nField: [ but it is not multiline] (eol = 1)\nField: [ because it doesn't start ] (eol = 1)\nField: [ with an immediate double quote\"] (eol = 1)\n$\n\n\nI said 'within its limitations'; what are its limitations?\n\nPrimarily, it isolates the raw field, rather than the converted field. Thus, the field it isolates has to be modified to produce the 'real' value, with the enclosing double quotes stripped off, and internal doubled double quotes replaced by single quotes. Converting a raw field to the real value mimics quite a lot of the code in the nextCsvField() function. The inputs are the start of the field and the end of the field (the separator character). Here's a second SSCCE with an extra function csvFieldData(), and the dissect() function shown above revised to call it. The format of the dissected output is slightly different so it looks better:\n\n#include <stdbool.h>\n#include <wchar.h>\n#include <wctype.h>\n\nextern const wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline);\n\n// Returns a pointer to the start of the next field,\n// or zero if this is the last field in the CSV\n// p is the start position of the field\n// sep is the separator used, i.e. comma or semicolon\n// newline says whether the field ends with a newline or with a comma\nconst wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline)\n{\n // Parse quoted sequences\n if ('\"' == p[0]) {\n p++;\n while (1) {\n // Find next double-quote\n p = wcschr(p, L'\"');\n // If we don't find it or it's the last symbol\n // then this is the last field\n if (!p || !p[1])\n return 0;\n // Check for \"\", it is an escaped double-quote\n if (p[1] != '\"')\n break;\n // Skip the escaped double-quote\n p += 2;\n }\n }\n\n // Find next newline or comma.\n wchar_t newline_or_sep[4] = L\"\\n\\r \";\n newline_or_sep[2] = sep;\n p = wcspbrk(p, newline_or_sep);\n\n // If no newline or separator, this is the last field.\n if (!p)\n return 0;\n\n // Check if we had newline.\n *newline = (p[0] == '\\r' || p[0] == '\\n');\n\n // Handle \"\\r\\n\", otherwise just increment\n if (p[0] == '\\r' && p[1] == '\\n')\n p += 2;\n else\n p++;\n\n return p;\n}\n\nstatic wchar_t *csvFieldData(const wchar_t *fld_s, const wchar_t *fld_e, wchar_t *buffer, size_t buflen)\n{\n wchar_t *dst = buffer;\n wchar_t *end = buffer + buflen - 1;\n const wchar_t *src = fld_s;\n\n if (*src == L'\"')\n {\n const wchar_t *p = src + 1;\n while (p < fld_e && dst < end)\n {\n if (p[0] == L'\"' && p+1 < fld_s && p[1] == L'\"')\n {\n *dst++ = p[0];\n p += 2;\n }\n else if (p[0] == L'\"')\n {\n p++;\n break;\n }\n else\n *dst++ = *p++;\n }\n src = p;\n }\n while (src < fld_e && dst < end)\n *dst++ = *src++;\n if (dst >= end)\n return 0;\n *dst = L'\\0';\n return(buffer);\n}\n\nstatic void dissect(const wchar_t *line)\n{\n const wchar_t *start = line;\n const wchar_t *next;\n bool eol;\n wprintf(L\"Input %3zd: [%.*ls]\\n\", wcslen(line), wcslen(line)-1, line);\n while ((next = nextCsvField(start, L',', &eol)) != 0)\n {\n wchar_t buffer[1024];\n wprintf(L\"Raw Field: [%.*ls] (eol = %d)\\n\", (next - start - eol), start, eol);\n if (csvFieldData(start, next-1, buffer, sizeof(buffer)/sizeof(buffer[0])) != 0)\n wprintf(L\"Field %3zd: [%ls]\\n\", wcslen(buffer), buffer);\n start = next;\n }\n}\n\nstatic const wchar_t multiline[] =\n L\"First field of first row,\\\"This field is multiline\\n\"\n \"\\n\"\n \"but that's OK because it's enclosed in double quotes, and this\\n\"\n \"is an escaped \\\"\\\" double quote\\\" but this one \\\"\\\" is not\\n\"\n \" \\\"This is second field of second row, but it is not multiline\\n\"\n \" because it doesn't start \\n\"\n \" with an immediate double quote\\\"\\n\"\n ;\n\nint main(void)\n{\n wchar_t line[1024];\n\n while (fgetws(line, sizeof(line)/sizeof(line[0]), stdin))\n dissect(line);\n dissect(multiline);\n\n return 0;\n}\n\n\nExample output\n\n$ ./wcsv < csv.data\nInput 16: [a,bb, c ,d\"\"e,f]\nRaw Field: [a,] (eol = 0)\nField 1: [a]\nRaw Field: [bb,] (eol = 0)\nField 2: [bb]\nRaw Field: [ c ,] (eol = 0)\nField 3: [ c ]\nRaw Field: [d\"\"e,] (eol = 0)\nField 4: [d\"\"e]\nRaw Field: [f] (eol = 1)\nField 1: [f]\nInput 17: [1,\"2\",\"\",,\"\"\"\",4]\nRaw Field: [1,] (eol = 0)\nField 1: [1]\nRaw Field: [\"2\",] (eol = 0)\nField 1: [2]\nRaw Field: [\"\",] (eol = 0)\nField 0: []\nRaw Field: [,] (eol = 0)\nField 0: []\nRaw Field: [\"\"\"\",] (eol = 0)\nField 2: [\"\"]\nRaw Field: [4] (eol = 1)\nField 1: [4]\nInput 296: [First field of first row,\"This field is multiline\n\nbut that's OK because it's enclosed in double quotes, and this\nis an escaped \"\" double quote\" but this one \"\" is not\n \"This is second field of second row, but it is not multiline\n because it doesn't start \n with an immediate double quote\"]\nRaw Field: [First field of first row,] (eol = 0)\nField 24: [First field of first row]\nRaw Field: [\"This field is multiline\n\nbut that's OK because it's enclosed in double quotes, and this\nis an escaped \"\" double quote\" but this one \"\" is not] (eol = 1)\nField 140: [This field is multiline\n\nbut that's OK because it's enclosed in double quotes, and this\nis an escaped \" double quote\" but this one \"\" is not]\nRaw Field: [ \"This is second field of second row,] (eol = 0)\nField 38: [ \"This is second field of second row]\nRaw Field: [ but it is not multiline] (eol = 1)\nField 24: [ but it is not multiline]\nRaw Field: [ because it doesn't start ] (eol = 1)\nField 28: [ because it doesn't start ]\nRaw Field: [ with an immediate double quote\"] (eol = 1)\nField 34: [ with an immediate double quote\"]\n$ \n\n\nI've not tested with \\r\\n (or plain \\r) line endings.",


More about “Parsing CSV file with multiline fields and escaped double quotes” related questions

Parsing CSV file with multiline fields and escaped double quotes

What is the best way to parse a CSV file with multiline fields and escaped quotes? For example, this CSV First field of first row,"This field is multiline but that's OK because it's enclosed in ...

Show Detail

Replace escaped double quotes in CSV

I am reading content from a csv. I want to replace the single and double quotes from the data. I have tried the following code but is not working. $resource = fopen($this-&gt;csv_file, 'r');

Show Detail

Parsing csv file with fields enclosed in double quotation marks and a field contains double quotes

I am parsing a CSV file with opencsv and all the fields are separated by commas and are enclosed in double quotation marks. If a field contains double quotes, open CSV doesn't parse the file proper...

Show Detail

rust-csv parse a string field wrapped in double quotes that contains newlines and uses double-quotes as escape character

I have a csv similar to this (original file is proprietary, cannot share). Separator is Tab. It contains a description column, whose text is wrapped in double quotes, can contain quoted strings, wh...

Show Detail

Quoted fields escaped in smarter_csv

I have a CSV file that I'm trying to process with smarter_csv in Ruby. Every field has double quuotes and there are some fields that have double quotes nested within them that are not escaped. I'm ...

Show Detail

CSV parsing with univocity-parsers and backslash-escaped quotes

I'm having some trouble parsing CSV with backslash escaped qoutes \&quot;. Most of lines in source CSV don't include escaped quotes but where there are I can't seem to find appropriate settings for

Show Detail

Serializing string with escaped double quotes to XML

I have a problem trying to serialize a Java Object to XML using jackson XmlMapper. One of the fields contains a string with escaped double quotes, like this: String example = &quot;This is a string...

Show Detail

Regex for matching Python multiline string with escaped characters

I'm parsing Python source code, and I've got regular expressions for single and double quoted strings (obtained by reading ridgerunner's answer to this thread). single_quote_re = "'([^'\\\\]*(?:\\...

Show Detail

How do I replace double quotes inside double quotes with empty character in CSV fields?

I have a CSV file with each fields quoted in double quotes. But some of the fields/strings itself has double quotes inside it and I want to remove them from that particular string. For example- On...

Show Detail

SPSS Syntax to import RFC 4180 CSV file with escaped double quotes

How do I read an RFC4180-standard CSV file into SPSS? Specifically, how to handle string values that have embedded double quotes which are (properly) escaped with a second double quote? Here's ...

Show Detail