Python Asyncio - 코루틴, 크롤링

Python

Python Asyncio - 코루틴, 크롤링

foxlee 2022. 1. 18. 16:15

복수개의 사이트에서 웹 페이지 가져오는 작업을 비동기로 구현하여 속도 개선

코루틴

async/await 문법으로 선언된 코루틴은 asyncio 응용 프로그램을 작성하는 기본 방법

Awaitable Object

await 표현식에서 사용될 수 있을 때 어웨이터블 객체 - 직접 async 함수를 작성하거나 외부 라이브러리의 경우 async로 구현된 것을 사용해야함

Future

비동기 연산의 최종 결과를 나타내는 특별한 저수준 awaitable object로, Future 객체를 기다릴 때, 그것은 코루틴이 Future가 다른 곳에서 해결될 때까지 기다릴 것을 뜻합니다. 콜백 기반 코드를 async/await와 함께 사용하려면 asyncio의 Future 객체가 필요합니다.

아래 코드에서 순서

loop = get_event_loop() 현재 이벤트 루프를 가져옴
loop.run_until_complete(async_fun(data_list)) async_fun 안에서 future들이 완료할때까지 실행
async_fun 함수가 실행되고 그 안에서 비동기 작업의 함수를 append 하여 future list를 만들고
gather awaitable 객체들을 gather 호출로 동시에 실행됨

## async를 적용한 여러 페이지 로딩하는 부분만, 그외 부분은 삭제함.
import asyncio
import aiohttp

# ./cralwer.py
class crawler:

    async def load(self, session):
        async with session.get(
            self._parse_url, headers=self._set_header()
        ) as response:
            text = await response.read()
            self._doc = BeautifulSoup(text, "lxml")

#./main.py
async def async_load_site(session, crawl):
	# 4
    await crawl.load(session) 
    # await : 다른 async 함수를 호출할때는 await(즉 await를 쓰려면 async를 써줘야함)
	# await : 를 정의한 곳에서 기다리지 않고 다른 작업에 권한을 넘겨줌(다른 작업을 진행)
    
async def load_all_sites(crawlers):
    async with aiohttp.ClientSession() as session:
        load_list = []
		# 2
		for crawl in crawlers:
        	# ensure_future : future - 작업 객체가 스케쥴 됨
            load = asyncio.ensure_future(async_load_site(session, crawl))
            load_list.append(load)
		# 3
		# ensure_future 의 작업 객체들을 동시 실행, 끝날때까지 기다림 결과를 리턴 받을 수도 있음
        await asyncio.gather(*load_list, return_exceptions=True)
        

def run_crawler():
    keywords = ["keyword1", "keyword2", "keyword3"]
    crawler_objects = [
        CrawlerCambridge,
        CrawlerCollins,
        CrawlerMacmillan,
        CrawlerMerriam,
        CrawlerOxford,
    ]

    crawlers = []
    for keyword in keywords:
        definitions = []
        examples = []

        for crawler_object in crawler_objects:
            crawler = crawler_object(logging)
            crawler.keyword = keyword
            crawler.keyword_type = keyword_type
            crawler.set_parse_url()
            crawlers.append(crawler)

	# 1
	# get_event_loop() : 이벤트 루프를 가져옴
    # run_until_complete : future(Future의 인스턴스)가 완료할 때까지 실행 / 퓨처의 결과를 반환하거나 퓨처의 예외를 일으
    asyncio.get_event_loop().run_until_complete(load_all_sites(crawlers))
    

    # 페이지 비동기로 모두 로딩 후의 로직
    for crawler in crawlers:
        crawler.parse()
        definitions.extend(crawler.definitions)
        examples.extend(crawler.examples)


if __name__ == "__main__":
       run_crawler()