contact  |  about  |  sitemap

Multi-thread extraction
Last Post 20 May 2010 05:03 PM by Extractor. 2 Replies.
Sort:
PrevPrev NextNext
Author Messages Informative
Extractor

--
17 May 2010 09:43 AM
Most of the times in running a script are spent waiting for the line "Get Web Page" to finish. You can run multiple instances of the same script to speed up the process at the expense of system resource and a messy desktop (virtual desktop software may help a little in the latter case). A good solution for fast extraction is multi-threaded "Get Web Page". However, I can't find a way to do so with the provided actions. Several days ago in another post, the support team mentioned the "Run shell command" action and using reading/writing files to communicate between scripts, which reminded me that I never use this action and this method in any djuggler script, and I did multi-thread downloading in Matlab, which is also non-multi-thread, by calling wget in the shell command, so maybe I can do the same thing in djuggler. After scripting and testing in the last two days, it is now multi-thread extracting.


The story

I have 200,000 id to check and get their detail information. By check users' activities on the community, e.g. posting and replying to posts, I can find more new id, and I expect the final number of id could be 1,000,000. Running 10 instances of a script for one day can finish about 1/5 of the id at hand. Beside the heavy resource consumptions on my working computer, it's gonna take almost one month to finish the extraction? Multi-thread it is.


The solution

It would be great if the future release support multi-thread. But with the available actions, I have to turn to other application for help. I use the windows version wget, instead of the "Get Web Page" action, to get a webpage and save the webpage with an informative (id, proxy, etc.) file name. After starts dozens of initial wget-threads using the "Run shell command", the script enters a forever Loop in which a "Loop Files" to read the downloaded webpages. Of course, the script has to check the size/date of files before handling because wget creates the target file BEFORE the download is finished. The script also needs to handle simple network error, e.g., timeout and server error. After reading the content of a saved webpage, the original actions following "Get Web Page" can be used, maybe with minor modifications. And after deleting the read webpage, a new wget-thread is started to get the webpage of another id.


403 Forbidden

You probably got banned if you hammer a web server, and you should be. You can change your IP frequently, if you have that many resources, or use anonymous software/network to hide your identity. After my static IP is banned, I used the Tor network, among others, and had the relay banned too in 20 min. You don't want a reduced network connection speed and have those generous people also got banned by your target victim. The final solution I used is changing the proxy IP address every time I connect to the target website. It is easy to find proxy server. There are many websites provide free working proxy with regular update.

Wget can work with proxy setting, just specify the proxy's ip:port in the parameters. But you need to do some proxy management to reject those have high error rate and those from the CoDeeN network. I set use_count and error_count for each proxy, and update the values when checking/reading a downloaded webpage. Now I have 270+ working proxy, with different communication error rates, 120+ deleted proxy because of high (> 0.2) error rate , and 140+ CoDeenN proxy useless for wget.


Let it run

The script starts 50 wget threads then enters the "Loop Files", and if there is no finished webpage, "Wait" for 3 sec before looping again. The "Wait" line seldom appears in the status bar, so I think the script is running at its full speed. However, there are only about 30 wget appear at the same time in the Task Manager, maybe that's the maximum capability of the script running on my system.

I tried to run two instances of this script, but "File not found" error pop up immediately. Exclusive file access checking is necessary, which could be to complicate to be implemented.


Other issues

(1) In some cases wget won't terminate itself even after the specified timeout (5 min), and I don't know the reason.
(2) "Loop Files" seems like to take one of the latest files available when the action starts, instead of take a snapshot of all files in the folder and then loop through them. In this case where new files are generated continuously, some old files are never read. So the number of files is increasing, at the rate of about one file per min. This is a problem in the long run, and I can't control the behavior of "Loop Files", nor can find an automatic solution at this moment.
(3) I want to apply this method to other script for extracting other websites, but I can't copy/paste actions between scripts, or using Label sections sub-scripts and share them between scripts. Now I know I can copy actions from one script, close that script and open the target one, then paste the actions. This alleviate the pain a little bit, but it not an elegant solution. If we talk about simplicity and ease of use, frankly I don't think djuggler can stand a chance when compared with those have nice GUI. But djuggler is unique and AFAIK there is no produces similar to djuggler, so djuggle still have the time and space to develop this necessary feature in the future release.

Thanks for your reading and I have to turn away and back to watching the files and threads flashing my screen until I get tire of it.
Tijn

--
20 May 2010 10:31 AM
Extractor this is quite impressive! Wget is indeed a very nice command line tool to combine with Djuggler. Multi threading is indeed something you need when you want speed in web scraping. I don't think it's something to build into Djuggler, because it would make the script environment much much more complicated.

The way to create multi threading with Djuggler is by creating two scripts. Typically a master script and multiple identical slave scripts. The master script puts URLs in a database, the slave scripts read the database for URLs to process. The loop files problem you describe is solved by the database mechanism. Another advantage of a database, you can run as many slave scripts needed from other IP addresses all over the world.

In my experience the web harvesting speed is determined by these factors:
- Your connection speed with the web
- The speed of the web server you request pages from
- The optimization in your script, DOM request or Source request. In case of DOM set the silent option!
- A single Windows computer can do about 5-6 web request simultaneous, more requests will dramatically slow down in performance. Meaning 5 slave scripts on single Windows installation is probably the max.
- A proxy will also slow down the web scraping process, but needed when you encounter IP request blocking after a certain request interval.
- Set the user agent to a typical spider string may help to stay out of web analytics reports.

Regarding to your third comment in other issues. I totally agree. However there is one advantage in Djuggler not using the Windows clipboard for copy and paste of it's own actions: you can copy and paste Djuggler actions without losing text strings in your Windows clipboard.

Tijn
Extractor

--
20 May 2010 05:03 PM
Update:

(1) the problem of file exclusive access when running multiple instances of script can be easily solved by using different folders for every instance. But the total number of wget running simultaneously on the system remains at around 30.

(2) the problem of files piling up is due to the "Delete File" can't delete the file successfully. For unknown reasons, the files still "being used by another program" even after its wget is finished. The delete action runs in the silent mood and the script moves on even if error occur in deleting files, leaving garbage files behind. Sometimes there are dozens of garbage files generated under the same id.

(3) the solution I finally used is essentially the same as the mentioned method of master-slave script and database of URLs. One script is looking for users, one script is looking for folders of users, and one script is looking for files of folders. Because of the problem in deleting file, additional queues are used to remember the id being checked. An id is put into the corresponding queue at the begging of extraction and deleted from the queue upon accomplished. Later, any found file is deleted directly if its id is not found in the queue; of course, error could also occur in such deletion, but no more re-download attempt will be made for that id.

(4) occasionally, if error in opening file occur, system error of "out of resource", or "Canvas does not allow drawing", is likely to follow.

(5) now, the main cause of terminating the run of a script is "File could not be opened; it might be opened by another program." in the action of "Open Text File". The file size is checked to ensure the download is finished before trying to open the file, so the file should not being used by any program, but yet, this error pops up randomly, and I don't have a clue to solve it.


Thanks for your reading.


Quick Reply
toggle
  Username:
Subject:
Body:
Security Code:
Enter the code shown above:

Submit

Powered by Active Forums

Forum participation and optional registration

You don't need to be registered to partcipate in the Djuggler forums, however if you want to subscribe to email notifications you need to register. You can also subscribe to the forum RSS feed.