PDFæäœã®åŒ·åãªå³æ¹ãpdfplumberã®å šè²ã解ãæãã
ã¯ããã«ïŒpdfplumberãšã¯ïŒ ð€
ããžã¿ã«åãé²ãçŸä»£ã«ãããŠãPDFïŒPortable Document FormatïŒã¯æ å ±ã®å ±æãä¿åã«äžå¯æ¬ ãªãã¡ã€ã«åœ¢åŒãšãªã£ãŠããŸããè«æ±æžãã¬ããŒããå¥çŽæžãè«æãªã©ãå€å²ã«ãããææžãPDFã§ããåããããŠããŸããããããPDFã¯ãã®æ§é äžãããã°ã©ã ããæ å ±ãæœåºãã«ãããšããåŽé¢ãæã£ãŠããŸããç¹ã«ãããã¹ãã ãã§ãªããè¡šïŒããŒãã«ïŒãå³åœ¢ãªã©ã®æ å ±ãæ£ç¢ºã«åãåºãã®ã¯ç°¡åã§ã¯ãããŸããã
ããã§ç»å Žããã®ããPythonã©ã€ãã©ãª pdfplumber
ã§ããpdfplumberã¯ãPDFãã¡ã€ã«ããããã¹ããã¡ã¿ããŒã¿ãè¡šãããã«ã¯ç·ãç©åœ¢ãæ²ç·ãšãã£ã詳现ãªã¬ã€ã¢ãŠãæ
å ±ãŸã§ãç°¡åã«æœåºã§ããããã«èšèšããã匷åãªããŒã«ã§ããæåãªPDF解æã©ã€ãã©ãªã§ãã pdfminer.six
ãããŒã¹ã«ããŠããããããŠãŒã¶ãŒãã¬ã³ããªãŒãªã€ã³ã¿ãŒãã§ãŒã¹ãšäŸ¿å©ãªæ©èœãæäŸããŠããŸãã
pdfplumberã®æ倧ã®ç¹åŸŽã¯ãPDFã®èŠèŠçãªã¬ã€ã¢ãŠãæ å ±ãéèŠããŠããç¹ã§ããåã«æåãæœåºããã ãã§ãªããæåã®äœçœ®ããµã€ãºããã©ã³ãæ å ±ãããã«ã¯ç·ãç©åœ¢ãšãã£ãå³åœ¢èŠçŽ ã解æããããšã§ãè€éãªã¬ã€ã¢ãŠããæã€PDFãããæ£ç¢ºã«æ å ±ãæœåºããããšãç®æããŠããŸããç¹ã«ã眫ç·ããªããŠãæŽåãããããŒã¿ããããŒãã«ãèªèã»æœåºããèœåã¯é«ãè©äŸ¡ãããŠããŸããð
ãã®ããã°èšäºã§ã¯ãpdfplumberã®åºæ¬çãªäœ¿ãæ¹ãããå°ãå¿çšçãªæ©èœãä»ã®ã©ã€ãã©ãªãšã®æ¯èŒããããŠå®éã®æŽ»çšäŸãŸã§ãå¹ åºãããããŠæ·±ã解説ããŠãããŸããPythonã䜿ã£ãŠPDFããããŒã¿ãæœåºããããšèããŠããéçºè ãããŒã¿ãµã€ãšã³ãã£ã¹ãããããã¯åã«PDFåŠçãèªååããããšèããŠããæ¹ã«ãšã£ãŠãå¿ èªã®å 容ãšãªãã§ãããããããpdfplumberã®äžçãæ¢æ±ããPDFããŒã¿æŽ»çšã®å¯èœæ§ãåºããŸãããïŒð
ã€ã³ã¹ããŒã«ïŒæºåãå§ããã ð ïž
pdfplumberã䜿ãå§ããã®ã¯éåžžã«ç°¡åã§ããPythonã®ããã±ãŒãžç®¡çããŒã«ã§ããpipã䜿ã£ãŠãã³ãã³ãã©ã€ã³ãã以äžã®ã³ãã³ããå®è¡ããã ãã§ãã
pip install pdfplumber
ããã«ãããpdfplumberæ¬äœãšãäŸåé¢ä¿ã«ããã©ã€ãã©ãªïŒpdfminer.six
ãªã©ïŒãèªåçã«ã€ã³ã¹ããŒã«ãããŸããç¹å¥ãªèšå®ãå€éšãœãããŠã§ã¢ã®ã€ã³ã¹ããŒã«ã¯åºæ¬çã«äžèŠã§ãã
ãããç¹å®ã®ããŒãžã§ã³ãã€ã³ã¹ããŒã«ãããå Žåã¯ãããŒãžã§ã³çªå·ãæå®ããŸãã
pip install pdfplumber==0.10.3 # ç¹å®ã®ããŒãžã§ã³ãã€ã³ã¹ããŒã«ããå Žå
ã€ã³ã¹ããŒã«ãå®äºããããPythonã®ã€ã³ã¿ããªã¿ãã¹ã¯ãªãããã import pdfplumber
ãšèšè¿°ããŠãã©ã€ãã©ãªãèªã¿èŸŒãæºåãæŽããŸããããã§ãPDFãã¡ã€ã«ãšã®å¯Ÿè©±ãéå§ã§ããŸããç°¡åã§ããïŒð
åºæ¬çãªäœ¿ãæ¹ïŒPDFãèªã¿è§£ã第äžæ© ð
pdfplumberã®åºæ¬çãªæäœã¯çŽæçã§ããããã§ã¯ãPDFãã¡ã€ã«ãéããããŒãžã«ã¢ã¯ã»ã¹ããããã¹ããããŒãã«ãæœåºããåºæ¬çãªæµããèŠãŠãããŸãããã
1. PDFãã¡ã€ã«ãéã
ãŸããæäœãããPDFãã¡ã€ã«ãéããŸããpdfplumber.open()
é¢æ°ã䜿çšããŸãããã¡ã€ã«ãã¹ãåŒæ°ãšããŠæž¡ããŸããwith
ã¹ããŒãã¡ã³ãã䜿ãããšã§ãåŠçãçµãã£ãåŸã«èªåçã«ãã¡ã€ã«ãéããããããããªãœãŒã¹ç®¡çã容æã«ãªããŸãã
import pdfplumber
pdf_path = "path/to/your/document.pdf" # 察象ã®PDFãã¡ã€ã«ãã¹ãæå®
try:
with pdfplumber.open(pdf_path) as pdf:
# ããã§PDFã«å¯Ÿããæäœãè¡ã
print(f"PDFãã¡ã€ã« '{pdf_path}' ãéããŸããã")
print(f"ç·ããŒãžæ°: {len(pdf.pages)}")
# ã¡ã¿ããŒã¿ã®è¡šç€º (ååšããå Žå)
print(f"ã¡ã¿ããŒã¿: {pdf.metadata}")
except FileNotFoundError:
print(f"ãšã©ãŒ: ãã¡ã€ã« '{pdf_path}' ãèŠã€ãããŸããã")
except Exception as e:
print(f"PDFåŠçäžã«ãšã©ãŒãçºçããŸãã: {e}")
pdfplumber.open()
㯠pdfplumber.PDF
ãªããžã§ã¯ããè¿ããŸãããã®ãªããžã§ã¯ããéããŠãPDFã®æ§ã
ãªæ
å ±ã«ã¢ã¯ã»ã¹ã§ããŸããäŸãã°ãpdf.pages
ã¯PDFå
ã®å
šããŒãžã Page
ãªããžã§ã¯ãã®ãªã¹ããšããŠä¿æããŠãããlen(pdf.pages)
ã§ç·ããŒãžæ°ãååŸã§ããŸããpdf.metadata
ã¯ãã¿ã€ãã«ãäœæè
ãäœææ¥æãªã©ã®ã¡ã¿ããŒã¿ãèŸæžåœ¢åŒã§è¿ããŸãïŒPDFã«ã¡ã¿ããŒã¿ãå«ãŸããŠããå ŽåïŒã
2. ããŒãžãžã®ã¢ã¯ã»ã¹
pdf.pages
ãªã¹ãã䜿ãããšã§ãç¹å®ã®ããŒãžã«ã¢ã¯ã»ã¹ã§ããŸãããªã¹ãã®ã€ã³ããã¯ã¹ã¯0ããå§ãŸããŸããäŸãã°ãæåã®ããŒãžã«ã¢ã¯ã»ã¹ããã«ã¯ pdf.pages[0]
ãšããŸãã
import pdfplumber
pdf_path = "path/to/your/document.pdf"
with pdfplumber.open(pdf_path) as pdf:
if len(pdf.pages) > 0:
# æåã®ããŒãžãååŸ (ã€ã³ããã¯ã¹ã¯0)
first_page = pdf.pages[0]
print(f"æåã®ããŒãžã®ããŒãžçªå·: {first_page.page_number}")
print(f"æåã®ããŒãžã®å¹
: {first_page.width} ãã€ã³ã")
print(f"æåã®ããŒãžã®é«ã: {first_page.height} ãã€ã³ã")
# 3ããŒãžç®ã«ã¢ã¯ã»ã¹ããå Žå (ååšããã°)
if len(pdf.pages) >= 3:
third_page = pdf.pages[2] # ã€ã³ããã¯ã¹ã¯2
print(f"3ããŒãžç®ã®ããŒãžçªå·: {third_page.page_number}")
else:
print("ãã®PDFã«ã¯ããŒãžããããŸããã")
å Page
ãªããžã§ã¯ãã¯ããã®ããŒãžã®ããŒãžçªå·ïŒpage_number
ã1å§ãŸãïŒãå¹
ïŒwidth
ïŒãé«ãïŒheight
ïŒãªã©ã®æ
å ±ãæã£ãŠããŸããåäœã¯PDFã®æšæºçãªåäœã§ãããã€ã³ãïŒ1ãã€ã³ã = 1/72ã€ã³ãïŒã§ãã
3. ããŒãžããããã¹ããæœåºãã
ç¹å®ã®ããŒãžããå
šãŠã®ããã¹ããæœåºããã«ã¯ãPage
ãªããžã§ã¯ãã® extract_text()
ã¡ãœããã䜿ããŸããããã¯ãããŒãžå
ã®ããã¹ããäžã€ã®æååãšããŠè¿ããŸãã
import pdfplumber
pdf_path = "path/to/your/document.pdf"
with pdfplumber.open(pdf_path) as pdf:
if len(pdf.pages) > 0:
first_page = pdf.pages[0]
# æåã®ããŒãžããããã¹ããæœåº
text = first_page.extract_text()
if text:
print("--- æåã®ããŒãžã®ããã¹ã ---")
print(text[:500]) # é·ãå Žåãããã®ã§æåã®500æåã ã衚瀺
print("...")
else:
print("æåã®ããŒãžããããã¹ãã¯æœåºãããŸããã§ããã")
# å
šããŒãžã®ããã¹ããçµåããå Žå
all_text = ""
for i, page in enumerate(pdf.pages):
page_text = page.extract_text()
if page_text:
all_text += f"--- ããŒãž {i+1} ---\n"
all_text += page_text + "\n\n"
# print("\n--- å
šããŒãžã®ããã¹ã ---")
# print(all_text) # å¿
èŠã«å¿ããŠã³ã¡ã³ãã¢ãŠãã解é€
else:
print("ãã®PDFã«ã¯ããŒãžããããŸããã")
extract_text()
ã¯ãå¯èœãªéãå
ã®PDFã®ã¬ã€ã¢ãŠãïŒã¹ããŒã¹ãæ¹è¡ïŒãä¿æããããšããŸãããè€éãªã¬ã€ã¢ãŠãã§ã¯å®ç§ã§ã¯ãªãå ŽåããããŸãããã詳现ãªå¶åŸ¡ãå¿
èŠãªå Žåã¯ãåŸè¿°ããæåïŒcharïŒãªããžã§ã¯ããžã®ã¢ã¯ã»ã¹ã圹ç«ã¡ãŸãã
extract_text()
ã¯ç©ºã®æååã None
ãè¿ãããšããããŸãããã®ãããªPDFããããã¹ããæœåºããã«ã¯ãOCRïŒå
åŠæåèªèïŒåŠçãå¥éå¿
èŠã«ãªããŸããpdfplumberèªäœã«ã¯OCRæ©èœã¯å«ãŸããŠããŸããã
4. ããŒãžããããŒãã«ïŒè¡šïŒãæœåºãã
pdfplumberã®åŒ·åãªæ©èœã®äžã€ãããŒãã«æœåºã§ããPage
ãªããžã§ã¯ãã® extract_tables()
ã¡ãœããã䜿ããŸãããã®ã¡ãœããã¯ãããŒãžå
ã§æ€åºãããå
šãŠã®ããŒãã«ããªã¹ããšããŠè¿ããŸããåããŒãã«ã¯ãè¡ã®ãªã¹ãã§ãããåè¡ã¯ã»ã«ã®ããã¹ãã®ãªã¹ãïŒãªã¹ãã®ãªã¹ãã®ãªã¹ãïŒãšããŠè¡šçŸãããŸãã
import pdfplumber
import pandas as pd # ããŒãã«ãèŠããã衚瀺ããããã«pandasãäœ¿çš (èŠã€ã³ã¹ããŒã«: pip install pandas)
pdf_path = "path/to/your/document_with_table.pdf" # ããŒãã«ãå«ãŸããPDFãæå®
with pdfplumber.open(pdf_path) as pdf:
if len(pdf.pages) > 0:
target_page = pdf.pages[0] # ããŒãã«ãããããŒãžãéžæ
# ããŒãã«ãæœåº
# table_settings ã§æœåºæŠç¥ã調æŽå¯èœ (åŸè¿°)
tables = target_page.extract_tables()
if tables:
print(f"{len(tables)}åã®ããŒãã«ãèŠã€ãããŸããã")
for i, table_data in enumerate(tables):
print(f"\n--- ããŒãã« {i+1} ---")
# print(table_data) # çã®ãªã¹ãã®ãªã¹ã圢åŒã§è¡šç€º
# pandas DataFrameã«å€æããŠè¡šç€ºãããšèŠããã
try:
df = pd.DataFrame(table_data[1:], columns=table_data[0]) # 1è¡ç®ãããããŒãšããå Žå
print(df.to_markdown(index=False)) # Markdown圢åŒã§è¡šç€º
except Exception as e:
print("Pandasã§ã®è¡šç€ºäžã«ãšã©ãŒãçºçããŸãããçã®ããŒã¿ã衚瀺ããŸãã")
print(table_data)
else:
print("ãã®ããŒãžã§ã¯ããŒãã«ãèŠã€ãããŸããã§ããã")
else:
print("ãã®PDFã«ã¯ããŒãžããããŸããã")
extract_tables()
ã¯ãPDFå
ã®ç·ïŒçœ«ç·ïŒãæåã®é
眮ã«åºã¥ããŠããŒãã«æ§é ãæšæž¬ããŸããããã©ã«ãã®èšå®ã§ãå€ãã®ããŒãã«ãããŸãæœåºã§ããŸãããè€éãªããŒãã«ã眫ç·ããªãããŒãã«ã®å Žåã¯ãåŸè¿°ããæœåºèšå®ïŒtable_settings
ïŒã調æŽããããšã§ç²ŸåºŠãåäžãããããšãã§ããŸãã
æœåºãããããŒãã«ããŒã¿ïŒtable_data
ïŒã¯ããã¹ãããããªã¹ãæ§é ã«ãªã£ãŠããŸããäŸãã°ãtable_data[0]
ã1è¡ç®ãtable_data[0][0]
ã1è¡ç®ã®1åç®ã®ã»ã«ã®ããã¹ããè¡šããŸããã»ã«ã®çµåãªã©ãè€éãªæ§é ãæã€ããŒãã«ã®å Žåã¯ãæœåºçµæãæåŸ
éãã«ãªããªãããšããããŸãããã®å Žåã¯ããã詳现ãªèŠçŽ ïŒç·ãæåïŒãžã®ã¢ã¯ã»ã¹ãå¿
èŠã«ãªããããããŸããã
é«åºŠãªæ©èœïŒpdfplumberãããã«æ·±ãç¥ã ð¬
pdfplumberã¯åºæ¬çãªããã¹ããããŒãã«æœåºã ãã§ãªãããã詳现ãªPDFå éšèŠçŽ ãžã®ã¢ã¯ã»ã¹ããæœåºããã»ã¹ã®ã«ã¹ã¿ãã€ãºãèŠèŠçãªãããã°æ©èœãªã©ãå€ãã®é«åºŠãªæ©èœãæäŸããŠããŸãã
1. 詳现ãªãªããžã§ã¯ããžã®ã¢ã¯ã»ã¹
Page
ãªããžã§ã¯ãã¯ãããã¹ããããŒãã«ã ãã§ãªããããäœã¬ãã«ãªæ§æèŠçŽ ã«ãã¢ã¯ã»ã¹ã§ããŸãã
- æå (Characters):
page.chars
ã¯ãããŒãžå ã®å šãŠã®æåãèŸæžã®ãªã¹ããšããŠè¿ããŸããåèŸæžã«ã¯ãæåãã®ãã®ïŒ'text'
ïŒãäœçœ®ïŒ'x0', 'y0', 'x1', 'y1'
ïŒããã©ã³ãåïŒ'fontname'
ïŒããµã€ãºïŒ'size'
ïŒãªã©ã®è©³çŽ°æ å ±ãå«ãŸããŸããç¹å®ã®ãã©ã³ãããµã€ãºã®æåã ããæœåºããããæåã®äœçœ®ã«åºã¥ããŠæ å ±ãã°ã«ãŒãåãããããéã«åœ¹ç«ã¡ãŸãã# æåã®ããŒãžã®æåæ å ±ãååŸ char_list = first_page.chars if char_list: print(f"æåã®æåã®æ å ±: {char_list[0]}") # {'text': 'ãµã³ãã«', 'fontname': 'Helvetica', 'size': 12.0, 'upright': True, ... , 'x0': 50.0, 'y0': 750.0, 'x1': 80.0, 'y1': 762.0}
- ç· (Lines):
page.lines
ã¯ãããŒãžå ã®å šãŠã®ç·ïŒçŽç·ïŒãèŸæžã®ãªã¹ããšããŠè¿ããŸããå§ç¹ãšçµç¹ã®åº§æšïŒ'x0', 'y0', 'x1', 'y1'
ïŒãç·å¹ ïŒ'linewidth'
ïŒãè²ãªã©ã®æ å ±ãå«ãŸããŸããããŒãã«ã®çœ«ç·ãçŽæ¥è§£æããããå³åœ¢ã®äžéšãèªèãããããã®ã«äœ¿ããŸãã - ç©åœ¢ (Rectangles):
page.rects
ã¯ãããŒãžå ã®å šãŠã®ç©åœ¢ïŒé·æ¹åœ¢ïŒãèŸæžã®ãªã¹ããšããŠè¿ããŸããäœçœ®ïŒ'x0', 'y0', 'x1', 'y1'
ïŒãç·å¹ ãå¡ãã€ã¶ãè²ïŒ'fill'
ïŒãªã©ã®æ å ±ãå«ãŸããŸããèæ¯è²ãä»ããé åããå³åœ¢ã®èªèã«å©çšã§ããŸãã - æ²ç· (Curves):
page.curves
ã¯ãããŒãžå ã®å šãŠã®ããžã§æ²ç·ãèŸæžã®ãªã¹ããšããŠè¿ããŸããè€éãªå³åœ¢ãã°ã©ãã®è§£æã«åœ¹ç«ã€å¯èœæ§ããããŸãã - ç»å (Images):
page.images
ã¯ãããŒãžå ã®ç»åã«é¢ããæ å ±ãèŸæžã®ãªã¹ããšããŠè¿ããŸããç»åã®äœçœ®ããµã€ãºãªã©ãååŸã§ããŸãããç»åããŒã¿ãã®ãã®ãçŽæ¥æœåºã»æäœããæ©èœã¯éå®çã§ãïŒä»ã®ã©ã€ãã©ãªãäŸãã°PyMuPDF
ã®æ¹ãåŸæãªå ŽåããããŸãïŒã
ãããã®è©³çŽ°ãªããžã§ã¯ãã掻çšããããšã§ãæšæºã®ããã¹ãæœåºãããŒãã«æœåºã§ã¯å¯Ÿå¿ã§ããªããéåžžã«ç¹æ®ãªã¬ã€ã¢ãŠããæ å ±æœåºã®ããŒãºã«å¿ããããšãå¯èœã«ãªããŸããäŸãã°ãç¹å®ã®åº§æšç¯å²ã«ããæåã ããæœåºããããç·ã®æ å ±ã䜿ã£ãŠç¬èªã®ããŒãã«èªèããžãã¯ãæ§ç¯ãããã§ããŸãã
2. èŠèŠçãªãããã°æ©èœ (Visual Debugging) ðŒïž
pdfplumberã®ãŠããŒã¯ã§éåžžã«äŸ¿å©ãªæ©èœã®äžã€ããèŠèŠçãªãããã°æ©èœã§ããããã¯ãpdfplumberãPDFãã©ã®ããã«è§£éããŠãããïŒæåã®äœçœ®ãæ€åºããç·ãããŒãã«ã®å¢çãªã©ïŒããå ã®PDFããŒãžäžã«æç»ããŠç»åãšããŠåºåããæ©èœã§ããæœåºãããŸããããªãå Žåãããã©ã¡ãŒã¿èª¿æŽãè¡ãéã«ãäœãåé¡ãªã®ããèŠèŠçã«ç¢ºèªã§ããããããããã°å¹çãå€§å¹ ã«åäžããŸãã
ãã®æ©èœã䜿ãã«ã¯ãPage
ãªããžã§ã¯ãã® debug_tablefinder(table_settings=...)
ã¡ãœããã to_image()
ã¡ãœããã䜿çšããŸããããããå©çšããã«ã¯ãè¿œå ã®ã©ã€ãã©ãª Pillow
ãš Wand
(ImageMagickã«äŸå) ãå¿
èŠã«ãªãå ŽåããããŸãã
pip install pdfplumber[image] # ç»åé¢é£ã®äŸåé¢ä¿ãäžç·ã«ã€ã³ã¹ããŒã«
import pdfplumber
pdf_path = "path/to/your/document_with_table.pdf"
with pdfplumber.open(pdf_path) as pdf:
if len(pdf.pages) > 0:
page = pdf.pages[0]
# ããŒãžå
šäœã®ãªããžã§ã¯ããæç»ããç»åãäœæ
im = page.to_image(resolution=150) # 解å床ãæå®
# æåã®ããŠã³ãã£ã³ã°ããã¯ã¹ãæç» (èµ€è²)
im.draw_rects(page.chars, stroke=(255, 0, 0), stroke_width=1)
# æ€åºãããç·ãæç» (éè²)
im.draw_lines(page.lines, stroke=(0, 0, 255), stroke_width=2)
# ããŒãã«æ€åºã®ãããã°æ
å ±ãæç»ããç»åãä¿å
# (table_settings ãæå®ããŠãç¹å®ã®æœåºæŠç¥ã§ã®çµæã確èªããããšãå¯èœ)
im_table_debug = page.debug_tablefinder()
im_table_debug.save("debug_tablefinder_output.png", format="PNG")
print("ããŒãã«æ€åºã®ãããã°ç»åã 'debug_tablefinder_output.png' ãšããŠä¿åããŸããã")
# ã«ã¹ã¿ã æç»ããç»åãä¿å
im.save("debug_page_output.png", format="PNG")
print("ã«ã¹ã¿ã æç»ç»åã 'debug_page_output.png' ãšããŠä¿åããŸããã")
else:
print("ãã®PDFã«ã¯ããŒãžããããŸããã")
çæãããç»åãã¡ã€ã«ïŒäŸ: debug_tablefinder_output.png
ïŒãéããšãpdfplumberãã©ã®ç·ãããŒãã«ã®çœ«ç·ãšèªèããã©ã®ã»ã«ãåºåã£ãŠããããªã©ãèŠèŠçã«è¡šç€ºãããŸããããã«ãããextract_tables()
ã®èšå®ïŒtable_settings
ïŒãã©ã®ããã«èª¿æŽããã°è¯ããã®ãã³ããåŸãããŸããð§
3. ããŒãã«æœåºèšå®ã®ã«ã¹ã¿ãã€ãº (table_settings)
extract_tables()
ã debug_tablefinder()
ã«ã¯ãtable_settings
ãšããåŒæ°ãæž¡ãããšã§ãããŒãã«æ€åºã®ã¢ã«ãŽãªãºã ã现ãã調æŽã§ããŸããããã¯èŸæžåœ¢åŒã§æå®ããŸãã
äž»ãªèšå®é ç®:
"vertical_strategy"
: åçŽãªåºåãç·ïŒåã®å¢çïŒãã©ã®ããã«èŠã€ãããã"lines"
ïŒæ瀺çãªç·ã䜿ãïŒã"text"
ïŒæåã®åçŽæ¹åã®æŽåã䜿ãïŒã"explicit"
ïŒ"explicit_vertical_lines"
ã§æå®ããç·ã®ã¿äœ¿ãïŒãªã©ãããã©ã«ãã¯"lines"
ã"horizontal_strategy"
: æ°Žå¹³ãªåºåãç·ïŒè¡ã®å¢çïŒãã©ã®ããã«èŠã€ãããã"lines"
,"text"
,"explicit"
ãªã©ãããã©ã«ãã¯"lines"
ã"explicit_vertical_lines"
: åçŽç·ã®åºåããšããŠäœ¿çšããç·ã®X座æšã®ãªã¹ããæå®ã"explicit_horizontal_lines"
: æ°Žå¹³ç·ã®åºåããšããŠäœ¿çšããç·ã®äžç«¯ã»äžç«¯ã®Y座æšã®ãªã¹ããæå®ã"snap_tolerance"
: æåãç·ãåãåçŽç·ã»æ°Žå¹³ç·äžã«ãããšã¿ãªãéã®èš±å®¹èª€å·®ïŒãã¯ã»ã«ïŒã"join_tolerance"
: è¿æ¥ããæåãåãã»ã«å ã«ãããšã¿ãªãéã®èš±å®¹èª€å·®ã"intersection_tolerance"
: ç·å士ã®äº€å·®ãå€å®ããéã®èš±å®¹èª€å·®ã"text_..."
:strategy
ã"text"
ã®å Žåã®è©³çŽ°èšå®ïŒäŸ:"text_x_tolerance"
,"text_y_tolerance"
ïŒã
äŸãã°ã眫ç·ããªãïŒæåã®æŽåã ãã§æ§æãããŠããïŒããŒãã«ãæœåºãããå Žåã¯ãvertical_strategy
ãš horizontal_strategy
ã "text"
ã«èšå®ãããšããŸãããããšããããŸãã
no_lines_table_settings = {
"vertical_strategy": "text",
"horizontal_strategy": "text",
"snap_tolerance": 3, # å¿
èŠã«å¿ããŠèª¿æŽ
"join_tolerance": 3, # å¿
èŠã«å¿ããŠèª¿æŽ
}
# 眫ç·ãªãããŒãã«ã®æœåºãè©Šã¿ã
tables = page.extract_tables(table_settings=no_lines_table_settings)
# ãããã°ç»åã§ç¢ºèª
im_debug = page.debug_tablefinder(table_settings=no_lines_table_settings)
im_debug.save("debug_no_lines_table.png")
æé©ãªèšå®ã¯PDFã®ã¬ã€ã¢ãŠãã«ãã£ãŠç°ãªããããèŠèŠçãªãããã°æ©èœãšçµã¿åãããŠè©Šè¡é¯èª€ããããšãéèŠã§ãã
4. ããŒãžã®ã¯ãããã³ã°ãšãã£ã«ã¿ãªã³ã°
ç¹å®ã®é åã®ã¿ãåŠç察象ãšãããå ŽåãPage
ãªããžã§ã¯ãã® crop()
ã¡ãœããã䟿å©ã§ããããŠã³ãã£ã³ã°ããã¯ã¹ïŒåº§æš (x0, top, x1, bottom)
ïŒãæå®ããŠããã®ç¯å²ã ããå«ãæ°ãã Page
ãªããžã§ã¯ãïŒæ£ç¢ºã«ã¯ PageCrop
ãªããžã§ã¯ãïŒãäœæã§ããŸãã
# ããŒãžã®å·Šäžéšåã ããã¯ããã (äŸ: x=0ãã200, y=0ãã300 ã®ç¯å²)
# 座æšç³»ã¯å·Šäžã (0, 0) ã§ãy軞ã¯äžåããæ£ãtop 㯠y0, bottom 㯠y1 ã«çžåœã
# bbox = (x0, top, x1, bottom)
bbox = (0, 0, 200, 300)
cropped_page = page.crop(bbox)
# ã¯ãããããé åããããã¹ããæœåº
cropped_text = cropped_page.extract_text()
print("--- ã¯ãããé åã®ããã¹ã ---")
print(cropped_text)
# ã¯ãããããé åããããŒãã«ãæœåº
cropped_tables = cropped_page.extract_tables()
# ...
ãŸããfilter()
ã¡ãœããã䜿ããšãç¹å®ã®æ¡ä»¶ãæºãããªããžã§ã¯ãïŒæåãç·ãªã©ïŒã ããå«ãæ°ããããŒãžãªããžã§ã¯ããäœæã§ããŸããäŸãã°ãç¹å®ã®ãã©ã³ããµã€ãºã®æåã ããæœåºãããå Žåãªã©ã«äœ¿çšããŸãã
# ãã©ã³ããµã€ãºã14ãã€ã³ããã倧ããæåã ãããã£ã«ã¿ãªã³ã°ããé¢æ°
def filter_large_text(obj):
return obj.get("object_type") == "char" and obj.get("size", 0) > 14
# ãã£ã«ã¿ãªã³ã°ãå®è¡
filtered_page = page.filter(filter_large_text)
# ãã£ã«ã¿ãªã³ã°ãããçµæããããã¹ããæœåº
large_text = filtered_page.extract_text()
print("--- ãµã€ãºã倧ããããã¹ã ---")
print(large_text)
ãããã®æ©èœãçµã¿åãããããšã§ãè€éãªPDFããã¥ã¡ã³ãããå¿ èŠãªæ å ±ã ããå¹ççãã€æ£ç¢ºã«æœåºããããšãå¯èœã«ãªããŸããâš
ä»ã®ã©ã€ãã©ãªãšã®æ¯èŒïŒã©ã®ããŒã«ãéžã¶ã¹ããïŒ ð€ð
Pythonã«ã¯PDFãæ±ãããã®ã©ã€ãã©ãªãããã€ãååšããŸããpdfplumber以å€ã§ããç¥ãããŠãããã®ã« PyPDF2
ã PyMuPDF (fitz)
ããããŸããããããã®ã©ã€ãã©ãªã«ã¯ç¹åŸŽããããç®çã«å¿ããŠäœ¿ãåããããšãéèŠã§ãã
ç¹åŸŽ | pdfplumber | PyPDF2 | PyMuPDF (fitz) |
---|---|---|---|
äž»ãªçšé | ããã¹ããããŒãã«ã詳现ãªã¬ã€ã¢ãŠãæ å ±ã®æœåº | PDFã®åå²ãçµåãå転ãæå·åãã¡ã¿ããŒã¿æäœãåºæ¬çãªããã¹ãæœåº | é«éãªããã¹ãã»ç»åæœåºãããŒãžã¬ã³ããªã³ã°ïŒç»ååïŒã泚éæäœãç·šéïŒããã¹ãè¿œå ãªã©ïŒ |
ããã¹ãæœåº | ã¬ã€ã¢ãŠããèæ ®ããæœåºãæååäœã§ã®è©³çŽ°æ å ±ååŸå¯èœ | åºæ¬çãªããã¹ãæœåºïŒã¬ã€ã¢ãŠãç¶æã¯éå®çïŒ | é«éãã€å€æ§ãªåœ¢åŒïŒãã¬ãŒã³ããã¹ããHTMLãJSONãXMLïŒã§ã®æœåºãåèªåäœã®æ å ±ãååŸå¯èœ |
ããŒãã«æœåº | åŸæåéãç·ãæåé 眮ã«åºã¥ãé«ç²ŸåºŠãªæœåºãèšå®èª¿æŽå¯èœã | çŽæ¥çãªããŒãã«æœåºæ©èœã¯ãªãïŒããã¹ãæœåºåŸã«èªåã§è§£æãå¿ èŠïŒ | çŽæ¥çãªããŒãã«æœåºæ©èœã¯ãªãïŒããã¹ããç·ã®æ å ±ããèªåã§è§£æãå¿ èŠïŒ |
å³åœ¢èŠçŽ ã¢ã¯ã»ã¹ | ç·ãç©åœ¢ãæ²ç·ã®åº§æšãå±æ§ãååŸå¯èœ | éå®ç | ãã¯ãã«ã°ã©ãã£ãã¯ã¹ïŒæç»ã³ãã³ãïŒãžã®ã¢ã¯ã»ã¹å¯èœ |
ç»åæœåº | äœçœ®ããµã€ãºã®ååŸã¯å¯èœã ããããŒã¿æœåºã¯éå®ç | éå®ç | é«å¹çãªç»åããŒã¿æœåºãå€æãå¯èœ |
ããŒãžæäœ | ã¯ãããã³ã°ããã£ã«ã¿ãªã³ã° | ããŒãžã®åå²ãçµåãå転ãé åºå€æŽãªã©ãåŸæ | ããŒãžã®æ¿å ¥ãåé€ãå転ãªã©å€æ§ãªæäœãå¯èœ |
PDFç·šé | äžå¯ | éå®çïŒãã©ãŒã å ¥åãªã©ïŒ | ããã¹ããç»åã泚éã®è¿œå ã»ç·šéã墚æ¶ããªã©å¯èœ |
ããã©ãŒãã³ã¹ | äžçšåºŠïŒè©³çŽ°ãªè§£æã®ããïŒ | æ¯èŒç軜é | éåžžã«é«éïŒCèšèªããŒã¹ã®MuPDFã«ãã€ã³ãïŒ |
èŠèŠçãããã° | 匷åãªæ©èœãã | ãªã | ãªãïŒãã ãããŒãžãç»ååããŠç¢ºèªã¯å®¹æïŒ |
äŸåé¢ä¿ | pdfminer.six ãªã© | å°ãªã | MuPDFïŒã©ã€ãã©ãªã«å«ãŸããïŒ |
ã©ã€ã»ã³ã¹ | MIT | BSD | AGPL / åçšã©ã€ã»ã³ã¹ |
䜿ãåãã®ãã€ã³ãïŒ
-
pdfplumber ãé©ããŠããã±ãŒã¹ïŒ
- PDFå ã®ããŒãã«ïŒè¡šïŒãé«ç²ŸåºŠã«æœåºãããå Žå âš
- æåã®äœçœ®ããã©ã³ããç·ãç©åœ¢ãªã©ã®è©³çŽ°ãªã¬ã€ã¢ãŠãæ å ±ã«åºã¥ããŠããŒã¿ãæœåºãããå Žå
- æœåºããžãã¯ã®ãããã°ãèŠèŠçã«è¡ãããå Žå
- pdfminer.six ããã䜿ããããå©çšãããå Žå
-
PyPDF2 ãé©ããŠããã±ãŒã¹ïŒ
- PDFãã¡ã€ã«ã®åå²ãçµåãããŒãžã®å転ãã¡ã¿ããŒã¿ã®èªã¿æžããšãã£ãåºæ¬çãªãã¡ã€ã«æäœãäž»ç®çã®å Žå
- ã·ã³ãã«ãªããã¹ãæœåºã§ååãªå Žå
- äŸåé¢ä¿ãå°ãªãä¿ã¡ããå Žå
-
PyMuPDF (fitz) ãé©ããŠããã±ãŒã¹ïŒ
- 倧éã®PDFãé«éã«åŠçããå¿ èŠãããå Žå ð
- ããã¹ããç»åãæ§ã ãªåœ¢åŒã§å¹ççã«æœåºãããå Žå
- PDFããŒãžãç»åãšããŠã¬ã³ããªã³ã°ïŒã©ã¹ã¿ã©ã€ãºïŒãããå Žå
- PDFã«æ³šéãè¿œå ããããç°¡åãªç·šéãè¡ãããå Žå
- åçšå©çšã§ãªããã°AGPLã©ã€ã»ã³ã¹ã§åé¡ãªãå ŽåïŒåçšå©çšã¯å¥éã©ã€ã»ã³ã¹ãå¿ èŠïŒ
çµè«ãšããŠãpdfplumberã¯ç¹ã«ããŒãã«æœåºãšè©³çŽ°ãªã¬ã€ã¢ãŠãæ å ±ã®è§£æã«ãããŠåªããèœåãçºæ®ããŸããä»ã®ã©ã€ãã©ãªã§ã¯é£ãããèŠèŠçãªæ§é ã«åºã¥ããããŒã¿æœåºã¿ã¹ã¯ã«æé©ã§ããäžæ¹ã§ãçŽç²ãªãã¡ã€ã«æäœãæé«éã®ããã¹ã/ç»åæœåºãPDFç·šéãå¿ èŠãªå Žåã¯ãPyPDF2ãPyMuPDFãããé©ããŠããå ŽåããããŸãããããžã§ã¯ãã®èŠä»¶ã«åãããŠæé©ãªã©ã€ãã©ãªãéžæããããšããå¹ççãªéçºãžã®éµãšãªããŸããð
ãŠãŒã¹ã±ãŒã¹ïŒpdfplumberã¯ã©ãã§æŽ»èºããïŒ ðŒ
pdfplumberã®åŒ·åãªæœåºèœåã¯ãæ§ã ãªåéã§æŽ»çšã§ããŸããå ·äœçãªãŠãŒã¹ã±ãŒã¹ãããã€ãèŠãŠã¿ãŸãããã
- è«æ±æžã»é åæžã®ããŒã¿æœåºïŒ PDF圢åŒã§éãããŠããè«æ±æžãé åæžãããçºè¡æ¥ãè«æ±å ãè«æ±å ãåç®ãéé¡ãªã©ã®æ å ±ãèªåã§æœåºããäŒèšã·ã¹ãã ãããŒã¿ããŒã¹ã«å ¥åãããç¹ã«ããŒãã«åœ¢åŒã§èšèŒãããåç®ãªã¹ãã®æœåºã«pdfplumberãåšåãçºæ®ããŸããð°
- éèã¬ããŒãã»åžå ŽåæïŒ éè¡ã蚌åžäŒç€Ÿãçºè¡ããPDFã¬ããŒããããæ ªäŸ¡ãæ¥çžŸããŒã¿ãçµæžææšãªã©ãæœåºããåæãå¯èŠåã«å©çšãããè€éãªã¬ã€ã¢ãŠãã®ããŒãã«ãã°ã©ãã®æ³šéããã¹ãã®æœåºã«åœ¹ç«ã¡ãŸããð
- åŠè¡è«æã»ç 究ããŒã¿ã®åéïŒ å€§éã®PDFè«æãããåèæç®ãªã¹ããå®éšçµæã®ããŒãã«ãç¹å®ã®ããŒã¯ãŒããå«ãã»ã¯ã·ã§ã³ãªã©ãæœåºããç 究ååã®åæãæç®ã¬ãã¥ãŒãå¹çåãããð¬
- å¥çŽæžã»æ³çææžã®åæïŒ å¥çŽæžPDFãããå¥çŽæéãåœäºè åãç¹å®ã®æ¡é ãªã©ãæœåºããå¥çŽç®¡çã·ã¹ãã ã«ç»é²ãããããªã¹ã¯åæãè¡ã£ãããããð
- 補åã«ã¿ãã°ã»ä»æ§æžã®æ¯èŒïŒ è€æ°ã®è£œåã«ã¿ãã°PDFãããã¹ããã¯è¡šã®æ å ±ãæœåºããæ¯èŒæ€èšã容æã«ãããð
- æ¿åºã»èªæ²»äœã®å ¬éææžããã®æ å ±ååŸïŒ å ¬éãããŠããPDF圢åŒã®çµ±èšããŒã¿ãè°äºé²ãäºç®æžãªã©ããå¿ èŠãªæ å ±ãæœåºããåžæ°æŽ»åãæ¿çåæã«æŽ»çšãããðïž
- ã¬ã¬ã·ãŒã·ã¹ãã ã®ããŒã¿ç§»è¡ïŒ å€ãã·ã¹ãã ããPDF圢åŒã§åºåããã垳祚ããŒã¿ããæ å ±ãæœåºããæ°ããã·ã¹ãã ãžç§»è¡ããéã®è£å©ããŒã«ãšããŠå©çšãããðŸ
ãããã®äŸã¯ã»ãã®äžéšã§ããpdfplumberã䜿ãã°ããããŸã§æäœæ¥ã§è¡ã£ãŠããããããã¯è«ŠããŠããPDFããã®æ å ±æœåºäœæ¥ãèªååã»å¹çåã§ããå¯èœæ§ãåºãããŸããã¢ã€ãã¢æ¬¡ç¬¬ã§ãæ§ã ãªæ¥åæ¹åãããŒã¿æŽ»çšãå®çŸã§ããã§ããããð¡
Tipsãšãã¹ããã©ã¯ãã£ã¹ïŒããäžæã䜿ãããã« âš
pdfplumberãå¹æçã«äœ¿ãããã®ãã³ãã泚æç¹ãããã€ã玹ä»ããŸãã
-
æååããžã®å¯Ÿå¿ïŒ PDFã«ãã£ãŠã¯ãåã蟌ã¿ãã©ã³ãã®åé¡ããšã³ã³ãŒãã£ã³ã°ã®åé¡ã§
extract_text()
ã®çµæãæååãããããšããããŸããããã¯pdfplumberã ãã®åé¡ã§ã¯ãªããPDF解æå šè¬ã®èª²é¡ã§ããæ®å¿µãªããæ ¹æ¬çãªè§£æ±ºçããªãå ŽåããããŸãããPyMuPDF
ãªã©ä»ã®ã©ã€ãã©ãªãè©Šããšæ¹åããããšããããŸãã -
ããŒãã«æœåºã®ç²ŸåºŠåäžïŒ
- ãŸãã¯ããã©ã«ãèšå®ã§è©ŠããããŸããããªãå Žåã¯
debug_tablefinder()
ã§èŠèŠçã«åé¡ç¹ã確èªããŸãã - 眫ç·ãæ確ãªå Žåã¯
"vertical_strategy": "lines", "horizontal_strategy": "lines"
(ããã©ã«ã) ãæå¹ã§ããç·ã®èªèæŒããããå Žåã¯snap_tolerance
ãintersection_tolerance
ã調æŽããŸãã - 眫ç·ããªãããŸãã¯äžå®å
šãªå Žåã¯
"text"
æŠç¥ãè©ŠããŸã ("vertical_strategy": "text"
ãªã©)ãtext_x_tolerance
,text_y_tolerance
ã®èª¿æŽãéµã«ãªãããšããããŸãã - ããŒãã«ã®äœçœ®ãåºå®ãããŠãããªãã
page.crop()
ã§ããŒãã«éšåã ããåãåºããŠããæœåºãããšãä»ã®èŠçŽ ã®åœ±é¿ãåãã«ãããªãã粟床ãåäžããããšããããŸãã - ã©ãããŠãäžæããããªãè€éãªããŒãã«ã¯ã
page.chars
ãpage.lines
ã®æ å ±ã䜿ã£ãŠãç¬èªã®è§£æããžãã¯ãå®è£ ããããšãæ€èšããŸãã
- ãŸãã¯ããã©ã«ãèšå®ã§è©ŠããããŸããããªãå Žåã¯
-
ããã©ãŒãã³ã¹ïŒ 倧éã®PDFãåŠçããå Žåãpdfplumberã¯PyMuPDFã«æ¯ã¹ãŠé
ããªãå¯èœæ§ããããŸããåŠçé床ãæåªå
ã®å Žåã¯PyMuPDFã®å©çšãæ€èšããããåŠçã䞊ååããïŒäŸ:
multiprocessing
ã©ã€ãã©ãªã䜿ãïŒãªã©ã®å·¥å€«ãå¿ èŠãããããŸããã -
ã¹ãã£ã³ãããPDFïŒç»åPDFïŒïŒ åè¿°ã®éããpdfplumberã¯ç»åå
ã®æåãèªãOCRæ©èœã¯æã£ãŠããŸãããã¹ãã£ã³ãããPDFãæ±ãå Žåã¯ãäºåã«
Tesseract OCR
ãªã©ã®OCRããŒã«ã§ããã¹ãæ å ±ãæã€PDFã«å€æããããOCRã©ã€ãã©ãªïŒäŸ:pytesseract
ïŒãšé£æºãããå¿ èŠããããŸãã -
ãšã©ãŒãã³ããªã³ã°ïŒ PDFãã¡ã€ã«ã¯ç ŽæããŠããããæšæºä»æ§ããå€ããç¹æ®ãªæ§é ãæã£ãŠãããããããšããããŸãã
pdfplumber.open()
ãåçš®æœåºã¡ãœããã®åŒã³åºãã¯try...except
ãããã¯ã§å²ã¿ãäºæãã¬ãšã©ãŒãçºçããŠãããã°ã©ã å šäœãåæ¢ããªãããã«ããããšãå ç¢ãªå®è£ ã®ããã«éèŠã§ããç¹å®ã®ãã¡ã€ã«ã§åé¡ãçºçããå Žåã«ã¹ãããããããšã©ãŒãã°ãèšé²ãããªã©ã®åŠçãå ããŸããããimport pdfplumber import logging logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') def process_pdf(pdf_path): try: with pdfplumber.open(pdf_path) as pdf: logging.info(f"Processing {pdf_path}...") # äŸ: æåã®ããŒãžã®ããã¹ããæœåº if pdf.pages: text = pdf.pages[0].extract_text() # print(text[:100]) # å¿ èŠã«å¿ããŠåŠç else: logging.warning(f"No pages found in {pdf_path}") except FileNotFoundError: logging.error(f"File not found: {pdf_path}") except Exception as e: # pdfplumber ãåŠçã§ããªã圢åŒãç Žæãã¡ã€ã«ãªã©ã®ãšã©ãŒããã£ãã logging.error(f"Failed to process {pdf_path}: {e}", exc_info=True) # exc_info=True ã§è©³çŽ°ãªãã¬ãŒã¹ããã¯ãåºå # 䜿çšäŸ # process_pdf("valid_document.pdf") # process_pdf("non_existent_file.pdf") # process_pdf("corrupted_or_unsupported.pdf")
-
ã©ã€ãã©ãªã®ããŒãžã§ã³ïŒ pdfplumberãäŸåã©ã€ãã©ãªïŒpdfminer.sixãªã©ïŒã¯ç¶ç¶çã«éçºãããŠããŸããæåŸ
éãã«åäœããªãå Žåããæ°ããæ©èœã䜿ãããå Žåã¯ãã©ã€ãã©ãªãææ°ããŒãžã§ã³ã«ã¢ããããŒãããŠã¿ããšè§£æ±ºããããšããããŸã (
pip install --upgrade pdfplumber
)ããã ããã¢ããããŒãã«ãã£ãŠæ¢åã®ã³ãŒãã®åäœãå€ããå¯èœæ§ãããããããã¹ããååã«è¡ãããšãéèŠã§ãã
ãŸãšãïŒpdfplumberã§PDFããŒã¿æŽ»çšã®æãéãã ðª
ãã®èšäºã§ã¯ãPythonã®åŒ·åãªPDF解æã©ã€ãã©ãªã§ãã pdfplumber ã«ã€ããŠãåºæ¬çãªäœ¿ãæ¹ããé«åºŠãªæ©èœãä»ã®ã©ã€ãã©ãªãšã®æ¯èŒããããŠå ·äœçãªæŽ»çšäŸãŸã§è©³ãã解説ããŸããã
pdfplumberã¯ãç¹ã«ããŒãã«æœåºã詳现ãªã¬ã€ã¢ãŠãæ å ±ïŒæåã®äœçœ®ãç·ãç©åœ¢ãªã©ïŒãžã®ã¢ã¯ã»ã¹ã«åŒ·ã¿ãæã£ãŠããŸããèŠèŠçãªãããã°æ©èœããè€éãªPDFã®è§£æãæœåºããžãã¯ã®èª¿æŽã«ãããŠéåžžã«åœ¹ç«ã¡ãŸãã
åºæ¬çãªããã¹ãæœåºãããè«æ±æžåŠçãã¬ããŒãåæãç 究ããŒã¿åéãªã©ãæ§ã ãªå Žé¢ã§PDFããã®ããŒã¿æœåºãèªååã»å¹çåããããã®åŒ·åãªããŒã«ãšãªãã§ãããããã¡ãããäžèœã§ã¯ãªããåŠçé床ãæ±ããããå ŽåãPDFã®ç·šéãå¿ èŠãªå Žå㯠PyMuPDFãåºæ¬çãªãã¡ã€ã«æäœãäž»ãªã PyPDF2 ãšãã£ãéžæè¢ãèæ ®ã«å ¥ããã¹ãã§ãã
ãã²ãããªãã®ãããžã§ã¯ããæ¥åã« pdfplumber ãå°å ¥ãããããŸã§æéãããã£ãŠããPDFãšã®æ Œéãã解æŸãããããŒã¿æŽ»çšã®æ°ããªå¯èœæ§ãæ¢ã£ãŠã¿ãŠãã ãããHappy plumbing! ð§ð
ã³ã¡ã³ã