multithreading - Java multi threading executing more than the loop bounds -
i creating web scraper pull both links , emails web. links used find new places search emails , emails stored in set. each link passed fixed thread pool in own thread more emails. started of small looking 10 emails reason code returns 13 emails.
while (emailset.size() <= email_max_count) { link = linkstovisit.poll(); linkstovisit.remove(link); linksvisited.add(link); pool.execute(new scraper(link)); } pool.shutdownnow(); emailset.stream().foreach((s) -> { system.out.println(s); }); system.out.println(emailset.size());
while understand possible create threads still running after 10 emails shouldn't pool.shutdownnow()
end threads?
here thread code if helps.
class scraper implements runnable { private string link; scraper(string s) { link = s; } @override public void run() { try { document doc = (document) jsoup.connect(link).get(); elements links = doc.select("a[href]"); (element href : links) { string newlink = href.attr("abs:href"); if (!linksvisited.contains(newlink) && !linkstovisit.contains(newlink)) { linkstovisit.add(newlink); } } pattern p = pattern.compile( "[a-za-z0-9_.+-]+@[a-za-z0-9-]+\\.[a-za-z0-9-.]+"); matcher matcher = p.matcher(doc.text()); while (matcher.find()) { emailset.add(matcher.group()); } } catch (exception e) { //catch on of many exceptions jsoup.connect might throw // , let thread expire. } } }
edit 1:
i should of included first time using thread safe set , queue.
set<string> emailset = collections.synchronizedset(new hashset()); blockingqueue<string> linkstovisit = new arrayblockingqueue(10000); set<string> linksvisited = collections.synchronizedset(new hashset()); final int email_max_count = 10; executorservice pool = newfixedthreadpool(25);
edit 2
figured should update question answer here problem was.
while (emailset.size() <= email_max_count) { link = linkstovisit.poll(); linkstovisit.remove(link); linksvisited.add(link); pool.execute(new scraper(link)); }
my list start off 1 link. after first link removed had empty list kept creating new threads no link search through. before list populated had created hundreds of threads doing nothing slowing down system until crashed.
here code fix ensure no threads created if there no link search.
while (emailset.size() <= email_max_count) { if (linkstovisit.size() > 0) { link = linkstovisit.poll(); linkstovisit.remove(link); linksvisited.add(link); pool.execute(new scraper(link)); //system.out.println("emails " + emailset.size()); } else { try { thread.sleep(100); } catch (interruptedexception ex) { logger.getlogger(crawler.class.getname()) .log(level.severe, null, ex); } } }
you start scaper in asyncronous way inside loop checks emailset size, during duration of 1 cycle of loop scraper can find more 1 email, or can start more 1 scraper , after start it adds alla email links page consider following timing
t1 loop start ->t2 loop schedule scaper ->t3 loop check emailset ->t4 scraper finds 13 email -> t5 loop check emailset
or following one
t1 loop start ->t2 loop schedule scaper "1" ->t3 loop check emailset ->t4 loop schedule scaper "2" t5 -> scraper "1" finds 6 emails -> t6 loop check emailset -> scraper "1" finds 7 emails
and on.
if want stop when find 10 emails have change following one
while (matcher.find()) { emailset.add(matcher.group()); }
to
while (matcher.find()) { if (emailset.size() <= email_max_count) { emailset.add(matcher.group()); } }
and not guarantee can stop @ email_max_count because multiple threads (3 example) can check size , 9 , them insert email.
you must synchronize read , write operation within single block (with synchronized(emailset)
or using lock) if want ensure exact emailset size; like
while (matcher.find()) { synchronized(emailset) { if (emailset.size() <= email_max_count) { emailset.add(matcher.group()); } } }
Comments
Post a Comment