Most of the times in running a script are spent waiting for the line "Get Web Page" to finish. You can run multiple instances of the same script to speed up the process at the expense of system resource and a messy desktop (virtual desktop software may help a little in the latter case). A good solution for fast extraction is multi-threaded "Get Web Page". However, I can't find a way to do so with the provided actions. Several days ago in another post, the support team mentioned the "Run shell command" action and using reading/writing files to communicate between scripts, which reminded me that I never use this action and this method in any djuggler script, and I did multi-thread downloading in Matlab, which is also non-multi-thread, by calling wget in the shell command, so maybe I can do the same thing in djuggler. After scripting and testing in the last two days, it is now multi-thread extracting. The story I have 200,000 id to check and get their detail information. By check users' activities on the community, e.g. posting and replying to posts, I can find more new id, and I expect the final number of id could be 1,000,000. Running 10 instances of a script for one day can finish about 1/5 of the id at hand. Beside the heavy resource consumptions on my working computer, it's gonna take almost one month to finish the extraction? Multi-thread it is. The solution It would be great if the future release support multi-thread. But with the available actions, I have to turn to other application for help. I use the windows version wget, instead of the "Get Web Page" action, to get a webpage and save the webpage with an informative (id, proxy, etc.) file name. After starts dozens of initial wget-threads using the "Run shell command", the script enters a forever Loop in which a "Loop Files" to read the downloaded webpages. Of course, the script has to check the size/date of files before handling because wget creates the target file BEFORE the download is finished. The script also needs to handle simple network error, e.g., timeout and server error. After reading the content of a saved webpage, the original actions following "Get Web Page" can be used, maybe with minor modifications. And after deleting the read webpage, a new wget-thread is started to get the webpage of another id. 403 Forbidden You probably got banned if you hammer a web server, and you should be. You can change your IP frequently, if you have that many resources, or use anonymous software/network to hide your identity. After my static IP is banned, I used the Tor network, among others, and had the relay banned too in 20 min. You don't want a reduced network connection speed and have those generous people also got banned by your target victim. The final solution I used is changing the proxy IP address every time I connect to the target website. It is easy to find proxy server. There are many websites provide free working proxy with regular update. Wget can work with proxy setting, just specify the proxy's ip:port in the parameters. But you need to do some proxy management to reject those have high error rate and those from the CoDeeN network. I set use_count and error_count for each proxy, and update the values when checking/reading a downloaded webpage. Now I have 270+ working proxy, with different communication error rates, 120+ deleted proxy because of high (> 0.2) error rate , and 140+ CoDeenN proxy useless for wget. Let it run The script starts 50 wget threads then enters the "Loop Files", and if there is no finished webpage, "Wait" for 3 sec before looping again. The "Wait" line seldom appears in the status bar, so I think the script is running at its full speed. However, there are only about 30 wget appear at the same time in the Task Manager, maybe that's the maximum capability of the script running on my system. I tried to run two instances of this script, but "File not found" error pop up immediately. Exclusive file access checking is necessary, which could be to complicate to be implemented. Other issues (1) In some cases wget won't terminate itself even after the specified timeout (5 min), and I don't know the reason. (2) "Loop Files" seems like to take one of the latest files available when the action starts, instead of take a snapshot of all files in the folder and then loop through them. In this case where new files are generated continuously, some old files are never read. So the number of files is increasing, at the rate of about one file per min. This is a problem in the long run, and I can't control the behavior of "Loop Files", nor can find an automatic solution at this moment. (3) I want to apply this method to other script for extracting other websites, but I can't copy/paste actions between scripts, or using Label sections sub-scripts and share them between scripts. Now I know I can copy actions from one script, close that script and open the target one, then paste the actions. This alleviate the pain a little bit, but it not an elegant solution. If we talk about simplicity and ease of use, frankly I don't think djuggler can stand a chance when compared with those have nice GUI. But djuggler is unique and AFAIK there is no produces similar to djuggler, so djuggle still have the time and space to develop this necessary feature in the future release. Thanks for your reading and I have to turn away and back to watching the files and threads flashing my screen until I get tire of it. |