|  | python mechanize/libxml2dom question |  | |
| | | bruce |  |
| Posted: Tue Sep 02, 2008 1:35 am Post subject: python mechanize/libxml2dom question |  |
| |  | |
hi...
i've got the following situation, with the following test url: "http://schedule.psu.edu/soc/fall/Alloz/a-c/acctg.html#".
i can generate a list of the tables i want for the courses on the page. however, when i try to create the xpath query, and plug it into the xpath within python, i'm missing something. if i have a parent xpath query, that generates a list of results/nodes... how can i then use the individual parent node, and trigger off of it, to get further information.
i tried using the following chunk of code with no luck.
#s is the html from the course file d = libxml2dom.parseString(s, html=1)
#at this point, we should have a vaild "d" representation print "sdddd=",s
aa=libxml2dom.toString(d) print "hereeeeee \n\n\n" print "aa",aa #sys.exit()
# **** course names
cpath='//table[position()>0]/descendant::td[position()=2][@width="85%"]/../t d[1]/font/a[2]/text()'
cpath_=[] cpath_=d.xpath(cpath)
print "len=",len(cpath_) if len(cpath_)>0:
for cpath in cpath_: #get the coursename info cname=cpath.toString() print "cpath=",cpath print "cname=",cname rr="./../../../../../../following-sibling::table//tr[position()>1]"
rr=cpath.xpath() print "rrlen=",len(rr) print rr[0].toString() sys.exit()
i'm assuming that there's a libxml2node method that will do what i need that i'm missing...
pointers/comments would be helpful here...
thanks! |
| |
| | | Stefan Behnel |  |
| Posted: Tue Sep 02, 2008 4:06 am Post subject: Re: python mechanize/libxml2dom question |  |
bruce wrote:
| Quote: | i've got the following situation, with the following test url: "http://schedule.psu.edu/soc/fall/Alloz/a-c/acctg.html#".
i can generate a list of the tables i want for the courses on the page. however, when i try to create the xpath query, and plug it into the xpath within python, i'm missing something. if i have a parent xpath query, that generates a list of results/nodes... how can i then use the individual parent node, and trigger off of it, to get further information. [code example stripped] |
You should really use lxml. It has callable XPath objects that feel like Python functions, and its Element objects have a getparent() method that gets you to the parent of the node. Plus, text strings that you get back from an XPath evaluation also have a getparent() method that returns the Element object that holds the text. I think that's what you were looking for.
Stefan |
| |
| | | Paul Boddie |  |
| Posted: Tue Sep 02, 2008 8:52 am Post subject: Re: python mechanize/libxml2dom question |  |
| |  | |
On 2 Sep, 05:35, "bruce" <bedoug...@earthlink.net> wrote:
| Quote: | i've got the following situation, with the following test url: "http://schedule.psu.edu/soc/fall/Alloz/a-c/acctg.html#".
i can generate a list of the tables i want for the courses on the page. however, when i try to create the xpath query, and plug it into the xpath within python, i'm missing something. if i have a parent xpath query, that generates a list of results/nodes... how can i then use the individual parent node, and trigger off of it, to get further information.
|
You can always use the parentNode property on the nodes you get as results from the XPath query, but I guess what you want to do is to "rewind" and issue queries relative to some ancestor of the result nodes.
[...]
| Quote: | # **** course names
cpath='//table[position()>0]/descendant::td[position()=2][@width="85%"]/../td[1]/font/a[2]/text()'
|
This obviously gets you right down to the hyperlink text within a part of the table. However, it may be easier to break this query up in order to get a more manageable overview of the process. My understanding of the above query is that it can first be rewritten as the following:
cpath = "//table//td[position()=2 and @width='85%']/../td[1]/font/a[2]/ text()"
Or even this:
cpath = "//table[.//td[position()=2 and @width='85%']]//td[1]/font/ a[2]/text()"
But what you could do is to obtain the important tables first:
tables = d.xpath("//table[.//td[position()=2 and @width='85%']]")
Here, we use the bracketed term to ensure that the table is the right one, but we don't actually descend inside the table.
You could, from this, get the name by doing a query from each of these tables:
for table in tables: cnames = table.xpath(".//td[1]/font/a[2]/text()") # list of text nodes
You might want to consider a slightly safer approach when getting the text:
cnames = table.xpath(".//td[1]/font/a[2]") # list of nodes, should be one name = cnames[0].textContent # all the text from the link
When looking for the details, you can then write your query relative to these tables, rather than having to figure out the location of the details from the text nodes you've just extracted.
details = table.xpath("following-sibling::table[1]") # list of max 1 node
| Quote: | i'm assuming that there's a libxml2node method that will do what i need that i'm missing...
|
You should be able to issue XPath queries from any node. There have been issues with libxml2dom and attribute nodes obtained from XPath, but these were fixed in recent changesets.
Paul |
| |
|
|