原先是想解析出在docx 中的文字跟圖片,
但是, 有些我們認為是圖片, 其實是用方程式表示, 實在是太麻煩了
就記錄一下怎麼使用 poi 來解析docx
https://poi.apache.org/components/document/quick-guide-xwpf.html
docx 使用xwpf 來解析
取得所有在doc下的 docx檔
File docx = new File("doc");
//get docx files
List<Path> docxfiles = new ArrayList<>();
docxfiles = Files.list(Path.of(docx.toURI()))
.filter(file -> file.toString().endsWith("docx"))
.collect(Collectors.toList());
讀出文字
XWPFDocument document = new XWPFDocument(Files.newInputStream(docxfile));
List<XWPFParagraph> paragraphs = document.getParagraphs();
paragraphs.forEach(paragraph -> {
String text = paragraph.getText();
});
讀出圖片, 這是讀取在XWPFParagraph的語法, 並轉成base64 這樣才知道圖片屬於哪一段
XWPFDocument document = new XWPFDocument(Files.newInputStream(docxfile));
List<XWPFParagraph> paragraphs = document.getParagraphs();
paragraphs.forEach(paragraph -> {
List<XWPFRun> runs = paragraph.getParagraph().getRuns();
runs.forEach(run -> {
if (run.getEmbeddedPictures().size() > 0) {
run.getEmbeddedPictures().forEach(xwpfPicture -> {
byte[] bytes = xwpfPicture.getPictureData().getData();
String encode = Base64.getEncoder().encodeToString(bytes);
});
}
});
});
