|  | Simple parsing with iostreams |  | |
| | | Richard Smith |  |
| Posted: Thu Jun 19, 2008 1:28 am Post subject: Simple parsing with iostreams |  |
| |  | |
I use C++ standard IOStreams fairly extensively, but, while I use quite a lot of the functionality for formatted output, when it comes to input, I usually stick to unformatted input (e.g. with get / peek / ignore and friends) or std::getline. I rarely find myself using or writing custom operator>> on user-defined types. And I'm wondering whether I'm missing out on something here.
To make it concrete, let me give a fairly simple example where I recently found myself wanting to write a custom operator>>. Consider a simple class representing an HTTP request line. (I've left out all of the stuff that is not germane to this discussion: I probably wouldn't have given it public data members in reality.)
struct request_line { std::string method, uri, version; };
std::ostream& operator<<( std::ostream& os, request_line const& rl ) { os << rl.method << ' ' << rl.uri; if ( not rl.version.empty() ) os << ' ' << rl.version; return os << "\r\n"; }
Each of the strings method, uri and version are guaranteed not to contain any white-space, and method and uri are guaranteed not to be empty. Some examples of serialised output are:
"POST /form.cgi HTTP/1.1\r\n" // all three components present "GET /index.html\r\n" // version omitted
So writing an operator>> really ought to be simple exercise using operator>>( std::istream&, std::string& ), as that already reads in one white-space delimited token. And the solution is nearly, but not quite, to do:
std::istream& operator<<( std::istream& in, request_line& rl ) { return in >> rl.method >> rl.uri >> rl.version >> std::ws; }
The complication arises from three facts:
1. All three components must be on the same line. "GET\r\n/index.html \r\n" should produce an error. 2. The version component is optional: if I reach the end of the line, without having four scanned in rl.version, that's fine. (The same is not true of the uri.) 3. I want to be strict about white-space checking: I want there to be precisely one space character (not a tab, or any other sort of white- space) between tokens, and none at the beginning of end of the line.
If I were writing perl, this would be easy: I would write something like:
($method, $uri, $version) = /^([^\s]+) ([^\s]+)(?: ([^\s]+))?\r\n$/;
But in C++ these all add complexity. And ordinarily it is at this point that I stop using formatted I/O and call std::getline and the std::string::find_first* functions. (Or perhaps Boost.Spirit or Boost.Regex in hairier examples.) But I'm hoping someone can suggest how to do this elegantly with std::istream.
-- Richard Smith
[ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Vidar Hasfjord |  |
| Posted: Thu Jun 19, 2008 8:22 am Post subject: Re: Simple parsing with iostreams |  |
| |  | |
On Jun 19, 2:28 am, Richard Smith <rich...@ex-parrot.com> wrote:
| Quote: | [...] Consider a simple class representing an HTTP request line. [...] Some examples of serialised output are:
"POST /form.cgi HTTP/1.1\r\n" // all three components present "GET /index.html\r\n" // version omitted [...] 1. All three components must be on the same line. "GET\r\n/index.html \r\n" should produce an error. 2. The version component is optional: if I reach the end of the line, without having four scanned in rl.version, that's fine. (The same is not true of the uri.) 3. I want to be strict about white-space checking: I want there to be precisely one space character (not a tab, or any other sort of white- space) between tokens, and none at the beginning of end of the line.
If I were writing perl, this would be easy: I would write something like:
($method, $uri, $version) = /^([^\s]+) ([^\s]+)(?: ([^\s]+))?\r\n$/;
But in C++ these all add complexity. And ordinarily it is at this point that I stop using formatted I/O and call std::getline and the std::string::find_first* functions. (Or perhaps Boost.Spirit or Boost.Regex in hairier examples.)
|
The tr1::regex (Boost.Regex) library is ideally suited for this problem. To parse the strings in this example you already need many features of a parsing library.
| Quote: | But I'm hoping someone can suggest how to do this elegantly with std::istream.
|
While I recommend tr1::regex for this problem, here's an istream solution expressing the parsing logic using ordinary C++ control flow:
istream& operator >> (istream& is, request_line& rl) { is >> resetiosflags (ios::skipws)
| Quote: | rl.method >> accept (' ') >> rl.uri; if (is.peek () == ' ') |
is >> accept (' ') >> rl.version; return is >> accept ('\r') >> accept ('\n'); }
The "accept" manipulator used here should only allow and consume the given character and otherwise set the failbit of the stream. I'll leave the implementation as an exercise.
Regards, Vidar Hasfjord
-- [ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
| | | Alberto Ganesh Barbati |  |
| Posted: Thu Jun 19, 2008 10:06 pm Post subject: Re: Simple parsing with iostreams |  |
| |  | |
Richard Smith ha scritto:
| Quote: | "POST /form.cgi HTTP/1.1\r\n" // all three components present "GET /index.html\r\n" // version omitted
So writing an operator>> really ought to be simple exercise using operator>>( std::istream&, std::string& ), as that already reads in one white-space delimited token. And the solution is nearly, but not quite, to do:
std::istream& operator<<( std::istream& in, request_line& rl ) { return in >> rl.method >> rl.uri >> rl.version >> std::ws; }
The complication arises from three facts:
1. All three components must be on the same line. "GET\r\n/index.html \r\n" should produce an error. 2. The version component is optional: if I reach the end of the line, without having four scanned in rl.version, that's fine. (The same is not true of the uri.) 3. I want to be strict about white-space checking: I want there to be precisely one space character (not a tab, or any other sort of white- space) between tokens, and none at the beginning of end of the line.
If I were writing perl, this would be easy: I would write something like:
($method, $uri, $version) = /^([^\s]+) ([^\s]+)(?: ([^\s]+))?\r\n$/;
But in C++ these all add complexity. And ordinarily it is at this point that I stop using formatted I/O and call std::getline and the std::string::find_first* functions. (Or perhaps Boost.Spirit or Boost.Regex in hairier examples.) But I'm hoping someone can suggest how to do this elegantly with std::istream.
|
I would go with the regex solution It's much more powerful and it's less likely that you miss some corner case you might want to check.
Anyway, since you asked, here's a solution with iostreams only:
---------------- std::istream& extract_space(std::istream& is) { if (is.peek() != ' ') is.setstate(std::ios_base::failbit); else is.ignore();
return is; }
std::istream& assert_non_ws(std::istream& is) { if (std::isspace(is.peek())) is.setstate(std::ios_base::failbit); return is; }
std::istream& operator<<(std::istream& is, request_line& rl) { std::string s; if(std::getline(is, s)) { std::istringstream line(s); std::string method, uri, version; line >> assert_non_ws >> method
| Quote: | extract_space >> assert_non_ws >> uri extract_space >> assert_non_ws >> version;
|
if (!method.empty() && !uri.empty() && line.eof()) { rl.method = method; rl.uri = uri; rl.version = version; } else { is.setstate(std::ios_base::failbit); } } return is; } ----------------
As you can see, it's a lot more work than one might expect, even for such a simple parse like that. But maybe it's because it's not actually that simple, isn't it? ;)
With a little more effort you may also get rid of the intermediate istringstream. This is left as an exercise for the reader (hint: you need an extractor that may fail in two different ways).
HTH,
Ganesh
-- [ See LINK for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |
| |
|
|