|
|
|
|
  |
|
XPath
Last Post 21 Apr 2010 05:03 PM by Extractor. 8 Replies.
|
Sort:
|
P. Reiss
 |
| 13 Apr 2010 04:13 AM |
|
I use Djuggler for alot of web extraction, once i am able to locate unique tags etc the extraction is fast and it works! BUT, locating starting and ending tags is a hassle! I dont do 1 or 2 text extraction, i do alot so pinpointing the correct tags and testing the extraction to work correctly is always very time consuming... i have read that XPath is a much better way to locate extraction content, is there anyway to use a sites XPath in order to create web extraction scripts? thanks |
|
|
|
|
Extractor
 |
| 13 Apr 2010 06:45 AM |
|
Second that. Supporting Xpath is necessary for extracting data out of dynamic web page. I use djuggler and several other web extractors. The djuggler runs fast and stable, but it takes time to make a right script for the job. The Web Inspector can identify the Xpath of an element, why not support the Xpath in the script, and the messy Actions List will be much more user-friendly. Keep going, djuggler. |
|
|
|
|
P.Reiss
 |
| 13 Apr 2010 12:34 PM |
|
thanks for backing up the opinion. I do alot of web extraction and have experimented with several different SW as well. I think Djugg is great but like we both mentioned, creating those web extraction scripts can be very trial and error and time consuming.. Djuggler does alot more the extraction so its my preferred platform compared to what else is out there Would love to hear what other programs you use, send me an email ilcaa@yahoo.com
|
|
|
|
|
Support Team
 |
| 13 Apr 2010 01:32 PM |
|
Using XPath besides a Copy Text Between action is an interesting idea. But XPath statements can become very complex. Could you give a simple example on how you would determine the Xpath? DS |
|
|
|
|
Extractor
 |
| 14 Apr 2010 08:04 AM |
|
I don't think djuggler can determine the Xpath with the actions it provide. But in its Web Inspector, the Xpath (sort of) is shown after you find an element with the Inspect button. You can Get Table Content by giving a Table Number (this Table Number is optional). Maybe that's the closest to Xpath. Web Inspector also show the number of a tag, e.g., IMG(5), DIV(12), but you can't use that number in an action to locate the tag. |
|
|
|
|
P.Reiss
 |
| 14 Apr 2010 09:12 PM |
|
yes, via the inspect window clicking the first element (of say 15 links title links you would like to extract) this would show xpath and the number of that element [1]. then use a variable to replace this number and also have a countelement function to automate the variable count process.
|
|
|
|
|
p.reiss
 |
| 14 Apr 2010 09:50 PM |
|
this SW doesnt work that well, it has alot of buffer problems but i like what they did with the xpath espacially with the cross-hatch that identifies the xpath and records it for you..... http://www.iewatch.com/wrprofessional.aspx |
|
|
|
|
Tijn
 |
| 20 Apr 2010 09:36 AM |
|
I think xpath can be useful in web extraction, however it can also be tricky. Xpath points to numbered nodes in the DOM tree. When something changes in the page (this could be a simple ad on the page) the numbers change and your script will break. The Djuggler method is more reliable I think. Most web sites today have nice CSS classes that can be used in web scraping. In Djuggler you use these classes in the Copy Text Between actions. In an Xpath they can also be very helpful. But Xpath does not have a wild-card which makes it hard to harvest sites with classes like <TD class="item even> and <TD class="item odd">. In Djuggler you would use <TD class="item*> and I don't see how you would do that with an Xpath. My conclusion, Xpath would be nice addition to Djuggler for the 'easy' web site scraping, but would most likely fail in the more complex situations. Tijn |
|
|
|
|
Extractor
 |
| 21 Apr 2010 05:03 PM |
|
Unfortunately yes, we can't get flexibility and automation at the same time. (1) Xpath does not have to find a node by an exact number. How about to get the text in .*.td[2].*.a[1] locating in .*.div.div.div.table.*.td. (2) In choosing a node, we can add more restriction by giving extra attribute names and values. After Get Web Page, I treat the page content as a big bank of text. So, there are many Find Text from Position, Copy Text from Position, then Get Web Page again, hopefully at this stage I can Read Next Image and Save Image. That brings back the memory of good old day learning data structure in C. |
|
|
|
|
|
  |
 |
 |
 |
|
|
|
|