原先是想解析出在docx 中的文字跟圖片,
但是, 有些我們認為是圖片, 其實是用方程式表示, 實在是太麻煩了
就記錄一下怎麼使用 poi 來解析docx
https://poi.apache.org/components/document/quick-guide-xwpf.html
docx 使用xwpf 來解析
取得所有在doc下的 docx檔
File docx = new File("doc"); //get docx files List<Path> docxfiles = new ArrayList<>(); docxfiles = Files.list(Path.of(docx.toURI())) .filter(file -> file.toString().endsWith("docx")) .collect(Collectors.toList());
讀出文字
XWPFDocument document = new XWPFDocument(Files.newInputStream(docxfile)); List<XWPFParagraph> paragraphs = document.getParagraphs(); paragraphs.forEach(paragraph -> { String text = paragraph.getText(); });
讀出圖片, 這是讀取在XWPFParagraph的語法, 並轉成base64 這樣才知道圖片屬於哪一段
XWPFDocument document = new XWPFDocument(Files.newInputStream(docxfile)); List<XWPFParagraph> paragraphs = document.getParagraphs(); paragraphs.forEach(paragraph -> { List<XWPFRun> runs = paragraph.getParagraph().getRuns(); runs.forEach(run -> { if (run.getEmbeddedPictures().size() > 0) { run.getEmbeddedPictures().forEach(xwpfPicture -> { byte[] bytes = xwpfPicture.getPictureData().getData(); String encode = Base64.getEncoder().encodeToString(bytes); }); } }); });