|  | converting a sed / grep / awk / . . . bash pipe line into py |  | |
| | | hofer |  |
| Posted: Tue Sep 02, 2008 5:36 pm Post subject: converting a sed / grep / awk / . . . bash pipe line into py |  |
| |  | |
Hi,
Something I have to do very often is filtering / transforming line based file contents and storing the result in an array or a dictionary.
Very often the functionallity exists already in form of a shell script with sed / awk / grep , . . . and I would like to have the same implementation in my script
What's a compact, efficient (no intermediate arrays generated / regexps compiled only once) way in python for such kind of 'pipe line'
Example 1 (in bash): (annotated with comment (thus not working) if copied / pasted #------------------------------------------------------------------------------------------- cat file \ ### read from file | sed 's/\.\..*//' \ ### remove '//' comments | sed 's/#.*//' \ ### remove '#' comments | grep -v '^\s*$' \ ### get rid of empty lines | awk '{ print $1 + $2 " " $2 }' \ ### knowing, that all remaining lines contain always at least \ ### two integers calculate sum and 'keep' second number | grep '^42 ' ### keep lines for which sum is 42 | awk '{ print $2 }' ### print number
Same example in perl: # I guess (but didn't try), taht the perl example will create more intermediate # data structures than necessary. # Ideally the python implementation shouldn't do this, but just 'chain' iterators. #------------------------------------------------------------------------------------------- my $filename= "file"; open(my $fh,$filename) or die "failed opening file $filename";
# order of 'pipeline' is syntactically reversed (if compared to shell script) my @numbers = map { $_->[1] } # extract num 2 grep { $_->[0] == 42 } # keep lines with result 42 map { [ $_->[0]+$_->[1],$_->[1] ] } # calculate sum of first two nums and keep second num map { [ split(' ',$_,3) ] } # split by white space grep { ! ($_ =~ /^\s*$/) } # remove empty lines map { $_ =~ s/#.*// ; $_} # strip '#' comments map { $_ =~ s/\/\/.*// ; $_} # strip '//' comments <$fh>; print "Numbers are:\n",join("\n",@numbers),"\n";
thanks in advance for any suggestions of how to code this (keeping the comments)
H |
| |
| | | Marc 'BlackJack' Rintsch |  |
| Posted: Tue Sep 02, 2008 5:36 pm Post subject: Re: converting a sed / grep / awk / . . . bash pipe line int |  |
On Tue, 02 Sep 2008 10:36:50 -0700, hofer wrote:
| Quote: | sed 's/\.\..*//' \ ### remove '//' comments | sed 's/#.*//'
|
Comment does not match the code. Or vice versa. :-)
Untested:
from __future__ import with_statement from itertools import ifilter, ifilterfalse, imap
def is_junk(line): line = line.rstrip() return not line or line.startswith('//') or line.startswith('#')
def extract_numbers(line): result = map(int, line.split()[:2]) assert len(result) == 2 return result
def main(): with open('test.txt') as lines: clean_lines = ifilterfalse(is_junk, lines) pairs = imap(extract_numbers, clean_lines) print '\n'.join(b for a, b in pairs if a + b == 42)
if __name__ == '__main__': main()
Ciao, Marc 'BlackJack' Rintsch |
| |
| | | Peter Otten |  |
| Posted: Wed Sep 03, 2008 5:15 am Post subject: Re: converting a sed / grep / awk / . . . bash pipe line int |  |
| |  | |
hofer wrote:
| Quote: | Something I have to do very often is filtering / transforming line based file contents and storing the result in an array or a dictionary.
Very often the functionallity exists already in form of a shell script with sed / awk / grep , . . . and I would like to have the same implementation in my script
What's a compact, efficient (no intermediate arrays generated / regexps compiled only once) way in python for such kind of 'pipe line'
Example 1 (in bash): (annotated with comment (thus not working) if copied / pasted
cat file \ ### read from file | sed 's/\.\..*//' \ ### remove '//' comments | sed 's/#.*//' \ ### remove '#' comments | grep -v '^\s*$' \ ### get rid of empty lines | awk '{ print $1 + $2 " " $2 }' \ ### knowing, that all remaining lines contain always at least \ ### two integers calculate sum and 'keep' second number | grep '^42 ' ### keep lines for which sum is 42 | awk '{ print $2 }' ### print number thanks in advance for any suggestions of how to code this (keeping the comments)
|
for line in open("file"): # read from file try: a, b = map(int, line.split(None, 2)[:2]) # remove extra columns, # convert to integer except ValueError: pass # remove comments, get rid of empty lines, # skip lines with less than two integers else: # line did start with two integers if a + b == 42: # keep lines for which the sum is 42 print b # print number
The hard part was keeping the comments ;)
Without them it looks better:
import sys for line in sys.stdin: try: a, b = map(int, line.split(None, 2)[:2]) except ValueError: pass else: if a + b == 42: print b
Peter |
| |
| | | Paul McGuire |  |
| Posted: Wed Sep 03, 2008 5:43 am Post subject: Re: converting a sed / grep / awk / . . . bash pipe line int |  |
| |  | |
On Sep 2, 12:36 pm, hofer <bla...@dungeon.de> wrote:
| Quote: | Hi,
Something I have to do very often is filtering / transforming line based file contents and storing the result in an array or a dictionary.
Very often the functionallity exists already in form of a shell script with sed / awk / grep , . . . and I would like to have the same implementation in my script
|
All that sed'ing, grep'ing and awk'ing, you might want to take a look at pyparsing. Here is a pyparsing take on your posted problem:
from pyparsing import LineEnd, Word, nums, LineStart, OneOrMore, restOfLine
test = """
1 2 3 47 23 // this will never match # blank lines are not of any interest 91 26
23 19
41 1 97 26 // extra numbers don't matter """
# define pyparsing expressions to match a line of integers EOL = LineEnd() integer = Word(nums)
# by default, pyparsing will implicitly skip over whitespace and # newlines, so EOL is skipped over by default - this would mix together # integers on consecutive lines - we only want OneOrMore integers as long # as they are on the same line, that is, integers with no intervening # EOL's line_of_integers = (LineStart() + integer + OneOrMore(~EOL + integer))
# use a parse action to identify the target lines def select_significant_values(t): v1, v2 = map(int, t[:2]) if v1+v2 == 42: print v2 line_of_integers.setParseAction(select_significant_values)
# skip over comments, wherever they are line_of_integers.ignore( '//' + restOfLine ) line_of_integers.ignore( '#' + restOfLine )
# use the line_of_integers expression to search through the test text # the parse action will print the matching values line_of_integers.searchString(test)
-- Paul |
| |
| | | Roy Smith |  |
| Posted: Wed Sep 03, 2008 9:54 am Post subject: Re: converting a sed / grep / awk / . . . bash pipe line int |  |
| |  | |
In article <g9ldi5$2ea$03$1@news.t-online.com>, Peter Otten <__peter__@web.de> wrote:
| Quote: | Without them it looks better:
import sys for line in sys.stdin: try: a, b = map(int, line.split(None, 2)[:2]) except ValueError: pass else: if a + b == 42: print b
|
I'm philosophically opposed to one-liners like:
| Quote: | a, b = map(int, line.split(None, 2)[:2])
|
because they're difficult to understand at a glance. You need to visually parse it and work your way out from the inside to figure out what's going on. Better to keep it longer and simpler.
Now that I've got my head around it, I realized there's no reason to make the split part so complicated. No reason to limit how many splits get done if you're explicitly going to slice the first two. And since you don't need to supply the second argument, the first one can be defaulted as well. So, you immediately get down to:
| Quote: | a, b = map(int, line.split()[:2])
|
which isn't too bad. I might take it one step further, however, and do:
| Quote: | fields = line.split()[:2] a, b = map(int, fields)
|
in fact, I might even get rid of the very generic, but conceptually overkill, use of map() and just write:
| Quote: | a, b = line.split()[:2] a = int(a) b = int(b) |
|
| |
| | | Peter Otten |  |
| Posted: Wed Sep 03, 2008 10:19 am Post subject: Re: converting a sed / grep / awk / . . . bash pipe line int |  |
| |  | |
Roy Smith wrote:
| Quote: | In article <g9ldi5$2ea$03$1@news.t-online.com>, Peter Otten <__peter__@web.de> wrote:
Without them it looks better:
import sys for line in sys.stdin: try: a, b = map(int, line.split(None, 2)[:2]) except ValueError: pass else: if a + b == 42: print b
I'm philosophically opposed to one-liners
|
I'm not, as long as you don't /force/ the code into one line.
| Quote: | like:
a, b = map(int, line.split(None, 2)[:2])
because they're difficult to understand at a glance. You need to visually parse it and work your way out from the inside to figure out what's going on. Better to keep it longer and simpler.
Now that I've got my head around it, I realized there's no reason to make the split part so complicated. No reason to limit how many splits get done if you're explicitly going to slice the first two. And since you don't need to supply the second argument, the first one can be defaulted as well. So, you immediately get down to:
a, b = map(int, line.split()[:2])
|
I agree that the above is an improvement.
| Quote: | which isn't too bad. I might take it one step further, however, and do:
fields = line.split()[:2] a, b = map(int, fields)
in fact, I might even get rid of the very generic, but conceptually overkill, use of map() and just write:
a, b = line.split()[:2] a = int(a) b = int(b)
|
If you go that route your next step is to introduce another try...except, one for the unpacking and another for the integer conversion...
Peter |
| |
| | | Roy Smith |  |
| Posted: Wed Sep 03, 2008 11:35 am Post subject: Re: converting a sed / grep / awk / . . . bash pipe line int |  |
In article <g9lvc5$8qq$03$1@news.t-online.com>, Peter Otten <__peter__@web.de> wrote:
| Quote: | I might take it one step further, however, and do:
fields = line.split()[:2] a, b = map(int, fields)
in fact, I might even get rid of the very generic, but conceptually overkill, use of map() and just write:
a, b = line.split()[:2] a = int(a) b = int(b)
If you go that route your next step is to introduce another try...except, one for the unpacking and another for the integer conversion...
|
Why another try/except? The potential unpack and conversion errors exist in both versions, and the existing try block catches them all. Splitting the one line up into three with some intermediate variables doesn't change that. |
| |
| | | Roy Smith |  |
| Posted: Wed Sep 03, 2008 11:41 am Post subject: Re: converting a sed / grep / awk / . . . bash pipe line int |  |
| |  | |
In article <7f2d4b4a-bc97-4b46-a31e-63f98e9fee73@34g2000hsh.googlegroups.com>, bearophileHUGS@lycos.com wrote:
| Quote: | Roy Smith: No reason to limit how many splits get done if you're explicitly going to slice the first two.
You are probably right for this problem, because most lines are 2 items long, but in scripts that have to process lines potentially composed of many parts, setting a max number of parts speeds up your script and reduces memory used, because you have less parts at the end.
Bye, bearophile
|
Sounds like premature optimization to me. Make it work and be easy to understand first. Then worry about how fast it is.
But, along those lines, I've often thought that split() needed a way to not just limit the number of splits, but to also throw away the extra stuff. Getting the first N fields of a string is something I've done often enough that refactoring the slicing operation right into the split() code seems worthwhile. And, it would be even faster  |
| |
| | | Peter Otten |  |
| Posted: Wed Sep 03, 2008 12:18 pm Post subject: Re: converting a sed / grep / awk / . . . bash pipe line int |  |
| |  | |
Roy Smith wrote:
| Quote: | In article <g9lvc5$8qq$03$1@news.t-online.com>, Peter Otten <__peter__@web.de> wrote:
I might take it one step further, however, and do:
fields = line.split()[:2] a, b = map(int, fields)
in fact, I might even get rid of the very generic, but conceptually overkill, use of map() and just write:
a, b = line.split()[:2] a = int(a) b = int(b)
If you go that route your next step is to introduce another try...except, one for the unpacking and another for the integer conversion...
Why another try/except? The potential unpack and conversion errors exist in both versions, and the existing try block catches them all. Splitting the one line up into three with some intermediate variables doesn't change that.
|
As I understood it you didn't just split a line of code into three, but wanted two processing steps. These logical steps are then somewhat remixed by the shared error handling. You lose the information which step failed. In the general case you may even mask a bug.
Peter |
| |
| | | Guest |  |
| Posted: Wed Sep 03, 2008 12:23 pm Post subject: Re: converting a sed / grep / awk / . . . bash pipe line int |  |
Roy Smith:
| Quote: | No reason to limit how many splits get done if you're explicitly going to slice the first two.
|
You are probably right for this problem, because most lines are 2 items long, but in scripts that have to process lines potentially composed of many parts, setting a max number of parts speeds up your script and reduces memory used, because you have less parts at the end.
Bye, bearophile |
| |
| Page 1 of 2 .:. Goto page 1, 2 Next | |
|
|