Forum


Replies: 8   Views: 2093
Transform documents to text
Topic closed:
Please note this is an old forum thread. Information in this post may be out-to-date and/or erroneous.
Every phpdocx version includes new features and improvements. Previously unsupported features may have been added to newer releases, or past issues may have been corrected.
We encourage you to download the current phpdocx version and check the Documentation available.

Posted by kobbe  · 11-02-2020 - 07:42

I've tried the transformDocument and docx2txt functions and succesfully transformed docx file to plain text.

But is there any way to get more information? For example what is headlines, font-sizes, font-weights, aligns, etc?

My goal is to get it into php-array with all data, so .txt file is not optimal.

Thanks for any advice. :)

Posted by kobbe  · 12-02-2020 - 12:29

Thank you, I am trying it out.

Getting error:

Call to a member function getAttribute() on null in <b>xxx/phpdocx-advanced-9.0/classes/CreateDocx.php</b> on line <b>5417</b><br />

Using this:

$docx = new CreateDocxFromTemplate($docx_file);
$referenceNode = [
   'type' => 'paragraph'
];
$styles = $docx->getWordStyles($referenceNode);

 

Checked the code and it looks like this:

   $pStyle = $nodeXPath->query($query, $contentNode);
if ($pStyle > 0) {
    $pStyleName = $pStyle->item(0)->getAttribute('w:val');

Did try a var_dump on $pStyle and this is the result:

object(DOMNodeList)#391 (1) {
  ["length"]=>
  int(0)
}
 

So looks like error in code using:

if ($pStyle > 0) {

 

Is this maybe fixed in 9.5?

Posted by admin  · 12-02-2020 - 12:35

Hello,

phpdocx 9.5 includes the following changes in those lines:

$pStyle = $nodeXPath->query($query, $contentNode);
if ($pStyle->length > 0) {

Regards.

Posted by kobbe  · 12-02-2020 - 12:35

Figured I just test 9.5... and it is fixed there!

Feel free to remove this comment + my bug report comment!! :)

Posted by kobbe  · 12-02-2020 - 13:07

Sorry, some follow up questions.

I sucessfully get all the styles using getWordStyles, and I also got all text content using getWordContents. But I do not understand how I will match the content with the style. 

If I filter the styles using "contains" the same content could come multiplie times, so I wont know what is what then.

Example:

$referenceNode = [
   'type' => 'paragraph',
   'contains' => 'test text'
];
$styles = $docx->getWordStyles($referenceNode);

"test text" might match mutliplie paragraphs.

 

Hoping for advice :)

Posted by kobbe  · 12-02-2020 - 13:33

Sorry for answering my self! ;) Looks like I can use same key!

   $referenceNode = [
      'type' => 'paragraph'
   ];
   $styles = $docx->getWordStyles($referenceNode);
$contents = $docx->getWordContents($referenceNode);
foreach ($contents as $key => $item) {
   if (empty($styles[$key]['pStyle']['val'])) {
      continue;
   }
   $style = $styles[$key]['pStyle']['val'];
   echo $style.' - ';
   echo $item."\n";
}

 

Posted by kobbe  · 12-02-2020 - 13:44

So instead new follow up question.

From looking at the style, how can I see that it is a headline and what depth it has? :o

array(2) {
   ["pStyle"]=>
  array(3) {
      ["type"]=>
    string(8) "w:pStyle"
      ["val"]=>
    string(32) "FormatmallRubrik1MnsterInget15gr"
      ["styles"]=>
    array(13) {
         [0]=>
      array(4) {
            ["tag"]=>
        string(7) "w:style"
            ["type"]=>
        string(4) "open"
            ["level"]=>
        int(1)
        ["attributes"]=>
        array(3) {
               ["w:type"]=>
          string(9) "paragraph"
               ["w:customStyle"]=>
          string(1) "1"
               ["w:styleId"]=>
          string(32) "FormatmallRubrik1MnsterInget15gr"
        }
      }
      [1]=>
      array(4) {
            ["tag"]=>
        string(6) "w:name"
            ["type"]=>
        string(8) "complete"
            ["level"]=>
        int(2)
        ["attributes"]=>
        array(1) {
               ["w:val"]=>
          string(49) "Formatmall Rubrik 1 + Mönster: Inget (15 % grå)"
        }
      }
      [2]=>
      array(4) {
            ["tag"]=>
        string(9) "w:basedOn"
            ["type"]=>
        string(8) "complete"
            ["level"]=>
        int(2)
        ["attributes"]=>
        array(1) {
               ["w:val"]=>
          string(7) "Rubrik1"
        }
      }
      [3]=>
      array(3) {
            ["tag"]=>
        string(14) "w:autoRedefine"
            ["type"]=>
        string(8) "complete"
            ["level"]=>
        int(2)
      }
      [4]=>
      array(4) {
            ["tag"]=>
        string(6) "w:rsid"
            ["type"]=>
        string(8) "complete"
            ["level"]=>
        int(2)
        ["attributes"]=>
        array(1) {
               ["w:val"]=>
          string(8) "00F90FB4"
        }
      }
      [5]=>
      array(3) {
            ["tag"]=>
        string(5) "w:pPr"
            ["type"]=>
        string(4) "open"
            ["level"]=>
        int(2)
      }
      [6]=>
      array(4) {
            ["tag"]=>
        string(5) "w:shd"
            ["type"]=>
        string(8) "complete"
            ["level"]=>
        int(3)
        ["attributes"]=>
        array(3) {
               ["w:val"]=>
          string(5) "clear"
               ["w:color"]=>
          string(4) "auto"
               ["w:fill"]=>
          string(6) "D9D9D9"
        }
      }
      [7]=>
      array(3) {
            ["tag"]=>
        string(5) "w:pPr"
            ["type"]=>
        string(5) "close"
            ["level"]=>
        int(2)
      }
      [8]=>
      array(3) {
            ["tag"]=>
        string(5) "w:rPr"
            ["type"]=>
        string(4) "open"
            ["level"]=>
        int(2)
      }
      [9]=>
      array(4) {
            ["tag"]=>
        string(8) "w:rFonts"
            ["type"]=>
        string(8) "complete"
            ["level"]=>
        int(3)
        ["attributes"]=>
        array(1) {
               ["w:cs"]=>
          string(15) "Times New Roman"
        }
      }
      [10]=>
      array(4) {
            ["tag"]=>
        string(6) "w:szCs"
            ["type"]=>
        string(8) "complete"
            ["level"]=>
        int(3)
        ["attributes"]=>
        array(1) {
               ["w:val"]=>
          string(2) "20"
        }
      }
      [11]=>
      array(3) {
            ["tag"]=>
        string(5) "w:rPr"
            ["type"]=>
        string(5) "close"
            ["level"]=>
        int(2)
      }
      [12]=>
      array(3) {
            ["tag"]=>
        string(7) "w:style"
            ["type"]=>
        string(5) "close"
            ["level"]=>
        int(1)
      }
    }
  }
  ["pPr"]=>
  array(3) {
      ["type"]=>
    string(5) "w:pPr"
      ["val"]=>
    string(5) "w:pPr"
      ["styles"]=>
    array(3) {
         [0]=>
      array(3) {
            ["tag"]=>
        string(5) "w:pPr"
            ["type"]=>
        string(4) "open"
            ["level"]=>
        int(1)
      }
      [1]=>
      array(4) {
            ["tag"]=>
        string(8) "w:pStyle"
            ["type"]=>
        string(8) "complete"
            ["level"]=>
        int(2)
        ["attributes"]=>
        array(1) {
               ["w:val"]=>
          string(32) "FormatmallRubrik1MnsterInget15gr"
        }
      }
      [2]=>
      array(3) {
            ["tag"]=>
        string(5) "w:pPr"
            ["type"]=>
        string(5) "close"
            ["level"]=>
        int(1)
      }
    }
  }
}

Posted by admin  · 12-02-2020 - 14:56

Hello,

Headings are applied using w:outlineLvl tags, and the attribute w:val sets the depth (w:val value + 1):

http://officeopenxml.com/WPparagraphProperties.php

It's set as a pPr style in custom styles and/or inline styles.

Regards.