Wednesday, May 16, 2012

AWK: Recursive Descent CSV Parser

In response to a Recursive Descent CSV parser in BASH, I (the original author of both posts) have made the following attempt to translate it into AWK script, for speed comparison of data processing with these scripting languages. The translation is not a 1:1 translation due to several mitigating factors, but to those who are interested, this implementation is faster at string processing than the other.



Originally we had a few questions that have all been quashed thanks to Jonathan Leffler.



This code is now ready for showdown.



Basic Features




  • Empty Fields

  • Literal Quoted Fields via double quote "

  • Backslash Escaped: Commas, Quotes, Backslashes1

  • ANSI C Escape Sequences: Tab, Newline1

  • No imposed limitations on input length, field length, or field count



1 Quoted fields have literal content, and neither backslash escapes nor ANSI C escape sequence expansions are performed on quoted content. One can however concatenate quotes, plain text and interpreted sequences in a single field to achieve the desired effect. For example:



one,two,three:\t"Little Endians," and one Big Endian Chief


Is a three field line of CSV where the third field is equivalent to:



three:        Little Endians, and one Big Endian Chief





Special Thanks to all Members of the SO community whose experience, time and input led me to create such a wonderfully useful tool for information handling.



Code Listing: csv.awk



# This script accepts and parses a single line of CSV input
# from STDIN. The ouput is seperated by command line
# variable 'delim'

# Special thanks to Jonathan Leffler, whose wisdom, and
# knowledge defined the output logic of this script.

function NextSymbol() {

strIndex++;
symbol = substr(input, strIndex, 1);

return (strIndex < parseExtent);

}

function Accept(query) {

# print "query: " query " symbol: " symbol
if ( symbol == query ) {
#print "matched!"
return NextSymbol();
}

return 0;

}

function Expect(query) {

# case: empty string...
if ( query == nothing && symbol == nothing ) return 1;

# case: else
if ( Accept(query) ) return 1;

msg = "csv parse error: expected '" query "': found '" symbol "'";
print msg > "/dev/stderr";

return 0;

}

function PushValue() {

item[itmIndex++] = value;
value = nothing;

}

function Quote() {

while ( symbol != quote && symbol != nothing ) {
value = value symbol;
NextSymbol();
}

Expect(quote);

}

function BackSlash() {

if ( symbol == quote || symbol == comma || symbol == backslash) {
value = value symbol;
} else if (symbol == "n") { # newline
value = sprintf("%s\n", value);
} else if (symbol == "t") { # tab
value = sprintf("%s\t", value);
} else {
value = value backslash symbol;
}

}

function Line() {

if ( Accept(quote) ) {
Quote();
Line();
}

if ( Accept(backslash) ) {
BackSlash();
NextSymbol();
Line();
}

if ( Accept(comma) ) {
PushValue();
Line();
}

if ( symbol != nothing ) {
value = value symbol;
NextSymbol();
Line();
} else if ( value != nothing ) PushValue();

}

BEGIN {

# State Variables
symbol = ""; value = ""; strIndex = 0; itmIndex = 0;

# Control Variables
parseExtent = 0;

# Symbol Classes
nothing = "";
comma = ",";
quote = "\"";
backslash = "\\";

getline input;
parseExtent = (length(input) + 2);
NextSymbol();
Line();

}

END {

if (itmIndex) {

itmIndex--;

for (i = 0; i < itmIndex; i++)
{
printf("%s", item[i] delim);
}

print item[i];

}

}





How to Run The Script "Like a Pro"



# Spit out some CSV "newline" delimited:
echo 'one,two,three,AWK,CSV!' | awk -v delim=$'\n' -f csv.awk

# Spit out some CSV "tab" delimited:
echo 'one,two,three,AWK,CSV!' | awk -v delim=$'\t' -f csv.awk

# Spit out some CSV "ASCII Group Seperator" delimited:
echo 'one,two,three,AWK,CSV!' | awk -v delim=$'\29' -f csv.awk


If you need some custom control seperators but aren't sure what to use consult this chart





No comments:

Post a Comment