[python] pdf 파일 내 특정 키워드 찾아 추출하기

728x90

pdf 파일을 순회하면서, 파일 내 텍스트를 추출하고자 했다.

PyMuPDF 패키지 설치

pip install PyMuPDF

pdf 파일 읽기

import fitz  # PyMuPDF

pdf_path = './file_name.pdf'
pdf_document = fitz.open(pdf_path)

다음과 같이, 해당 경로의 pdf 파일에 접근할 수 있다.

해당 파일의 텍스트를 읽기 위해서는

import fitz  # PyMuPDF

pdf_path = './file_name.pdf'
pdf_document = fitz.open(pdf_path)

page_index = 0  # 페이지 번호 (0부터 시작)
page = pdf_document[page_index]
text = page.get_text()

각 page들의 index에 접근한 후, get_text 메서드를 사용해 주면 된다.

처음~종료 텍스트까지 추출하기

최종적으로 구현하고자 했던 건 → 해당 pdf파일의 처음부터 특정 키워드까지의 텍스트를 추출하는 것이었다.

def extract_text_from_pdf(pdf_path, end_keyword):
    pdf_document = fitz.open(pdf_path)

    # 종료 키워드 위치 찾기
    end_page_index = None
    end_index = None
    for page_number in range(0, pdf_document.page_count):
        page = pdf_document[page_number]
        text = page.get_text()
        end_index = text.find(end_keyword)
        if end_index != -1:
            end_page_index = page_number
            break

    # 종료 키워드가 발견되지 않으면 종료
    if end_index == -1:
        end_index = len(pdf_document[0].get_text())
        print(f"종료 키워드 '{end_keyword}'를 찾을 수 없습니다.")
        return None
    
    # 시작 키워드부터 종료 키워드까지의 전체 텍스트 추출
    full_text = ''
    for page_index in range(0, end_page_index+1):
        page = pdf_document[page_index]
        if (page_index == end_page_index):
        # 종료 키워드가 존재하는 페이지일 경우, 종료 키워드 index까지만 추출
            full_text += page.get_text()[0:end_index]
        else :
            full_text += page.get_text()

    return full_text

해당 기능을 구현한 함수!

파일 경로와 종료 키워드를 입력받아, 파일 전체를 순회하며 종료 키워드를 찾는다.

종료 키워드가 존재할 경우 해당 page의 index와, 해당 텍스트의 index를 저장!

(존재하지 않을 경우 에러를 반환했다)

그 후 0번째 페이지부터 종료 키워드 페이지를 순회하며, 사이에 있는 모든 텍스트들을 저장해 줬다.

import os

# 특정 폴더 경로
folder_path = './폴더'

# 폴더 내의 모든 파일 목록 가져오기
file_list = os.listdir(folder_path)

# PDF 파일만 선택
pdf_files = [file for file in file_list if file.lower().endswith('.pdf')]

# 선택된 PDF 파일들에 대해 작업 수행
for pdf_file in pdf_files:
    # 각 PDF 파일에 대한 작업 수행
    full_path = os.path.join(folder_path, pdf_file)
    extract_text = extract_text_from_pdf(full_path, '이유') # 처음 ~ '이유'라는 텍스트 앞까지만 추출
    save_to_csv(str(full_path), extract_text) // csv파일로 저장하는 함수

실제 해당 함수를 사용한 모습!

특정 폴더 하위의 pdf 파일들을 모두 선택한 후, 각 파일들을 순회하며 특정 키워드 앞까지의 텍스트를 추출할 수 있다!

728x90

'source-code > etc' 카테고리의 다른 글

[Github Actions] release assets에 build 결과물 zip 업로드하기 (0)	2024.05.05
[Github Actions] PR 병합 시, 자동으로 package publish 하기 (0)	2024.03.26
[Github Actions] Workflows에서 복수개 env variables 사용하기 (1)	2024.03.25

source-code-lean

[python] pdf 파일 내 특정 키워드 찾아 추출하기

PyMuPDF 패키지 설치

pdf 파일 읽기

처음~종료 텍스트까지 추출하기

'source-code > etc' 카테고리의 다른 글

티스토리툴바

[python] pdf 파일 내 특정 키워드 찾아 추출하기

PyMuPDF 패키지 설치

pdf 파일 읽기

처음~종료 텍스트까지 추출하기

'source-code > etc' 카테고리의 다른 글

관련글

티스토리툴바