2019 13 Jun

Content migration in Drupal from XML Source

By Sachin Tibrewal

data migration

migration

Drupal 8 migration

migration in Drupal 8

One of our recent posts talks about migration API in Drupal 8, the basics of migration and migrating data for various field types from a CSV source. Some frameworks and CMS such as Wordpress, etc allow the data to be exported in XML format (or JSON format if certain extensions are available). And more often than not, we find another agency providing us with the XML data of the site which needs to be migrated to Drupal. In this post, we see how content can be migrated to Drupal from an XML source.

Note: It is not a pre-requisite but basic knowledge of XPath selectors can be very helpful.

The major requirements/dependencies that need to be fulfilled are as follows:

Custom module - It contains scripts that will be imported when the module is installed and must have these dependencies: Migrate, Migrate Plus, Migrate Tools.
The main objective for creating a custom module is that when content is successfully migrated and requires no further updates, removing the custom module will remove the migration scripts that were imported without affecting any other workflow on the site.
Source XML file - This file can be external (must be accessible over HTTP) or locally stored (in private file directory).

Writing migrating scripts:

Assuming we are acquainted with basic elements in a migration script such as id, label, migration group, etc., we move on to the most important components of the migration scripts:
Source ~ Process ~ Destination

Let’s have a look at the important parameters that need to be defined.

Source
- url - Source plugin used.
- data_fetcher_plugin - Defines how to retrieve the source data, either a general URL/local file or over an HTTP connection.
- data_parser_plugin - Defines the format for parsing source data such as JSON, XML, soap.
- urls - URL to file or storage path to file with a stream wrapper (can be multiple).
- item_selector - Data is parsed as nodes in XML using XPath selectors. This property identifies the individual item to be migrated.
- fields - Under this parameter, we map the fields to machine names that can be used in the process part of the migration. Each field will have three keys linked to itself: name, label, selector.
  - ‘name’ - unique name to identify the field in other parts of the migration.
  - ‘label’ - describes the type of data.
  - ‘selector’ - the XPath selector relative to the path defined in item_selector to extract the data for the field from the source file.
- ids - Defines the unique key to be used for mapping in migration tables.
Process
This part has a general template for any migration with any source plugin. The values mapped under the ‘fields’ parameter in the source plugin is manipulated using various process plugins and then assigned to fields of the entities being migrated. We can use the result of one process plugin as the input for another through chaining.
Destination
Destination plugin defines the target entity to be created using the resultant data from the process plugin.

Here’s an illustration that depicts the process of migration of files.

For creating a file entity in Drupal following attributes are required: filename, uri, uid, status.
We can define default values for uid and status but filename and uri must be unique so these values must be extracted from the source XML file while migrating files.

The sample given below contains data for files related to books. Each <item> node represents a file and has a title, location and unique ID associated with it.

<books>
  <item>
    <title>In Search of Lost Time</title>
    <fid>1</fid>
    <link><![CDATA[https://www.demo.com/images/books/cover-image/search-lost-time.png]]></link>
  </item>
  <item>
    <title>The Lost Symbol</title>
    <fid>2</fid>
    <link><![CDATA[https://www.demo.com/images/books/cover-image/lost-symbol.jpg]]></link>
  </item>
  <item>
    <title>The Alchemist</title>
    <fid>3</fid>
    <link><![CDATA[https://www.demo.com/images/books/cover-image/alchemist.png]]></link>
  </item>
</books>

Sample XML data

Migration template:

SOURCE:

source:
  # We use the XML data parser plugin.
  plugin: url
  data_fetcher_plugin: http
  data_parser_plugin: xml
  urls: 'private://books/files.xml'
  # The XPath to use to query the desired elements.
  item_selector: /books/item
  # Under 'fields', we list the data items to be imported. The first level keys 
  # are the source field names we want to populate (the names to be used as 
  # sources in the process configuration below. For each field, we're importing,
  # we provide a label (optional - this is for display in migration tools) and
  # an XPath for retrieving that value. This XPath is relative to the elements 
  # retrieved by item_selector.
  fields:
    -
      name: fid
      label: 'File ID'
      selector: fid
    -
      name: url
      label: 'File Link'
      selector: link
  # Under 'ids', we identify source fields populated above which will uniquely
  # identify each imported item. The 'type' makes sure the migration map table 
  # uses the proper schema type for stored the IDs.
  ids:
    fid:
      type: integer
  # Constants can be defined
  constants:
    file_dest_uri: 'public://books/images'
...

Source plugin used is ‘url’.
Using ‘http’ as ‘data_fetcher_plugin’ facilitates the use of request headers, authentication and flexible storage of the source file.
Since the data to be parsed is in XML format, the data_parser_plugin used is ‘xml’.
The source file used is ‘files.xml’ which is stored locally in the ‘private’ directory so we can use the private stream wrapper to access the file.
Each <item> node in the sample data maps to one file entity to be migrated so we define these individual nodes in the item_selector for which the XPath notation is ‘/books/item’.
The destination file location can be defined by us so we only need the source file location to download and save the file, and a unique ID to identify the mapping in the migration tables for lookups. These have been defined under the ‘fields’ parameter.
```
  <item>
    <title>In Search of Lost Time</title>
    <fid>1</fid>
    <link><![CDATA[https://www.demo.com/images/books/cover-image/search-lost-time.png]]></link>
  </item>
  
```
```
  <item>
    <title>The Lost Symbol</title>
    <fid>2</fid>
    <link><![CDATA[https://www.demo.com/images/books/cover-image/lost-symbol.jpg]]></link>
  </item>
```
Each parent node ‘/book/item’ represents one file entity and the source file location can be obtained from ‘link’ node.
Since the ‘fid’ field is unique and can be used as source ID for mapping in the migration tables, we specify it under ‘ids’ parameter.

PROCESS:

...

process:
  # Assign 'url' value to a temporary variable.
  file_source: url
  # Using the 'explode' plugin and '/' as delimiter on file URL, we obtain 
  # an array with the file name as the last element and using 'array_pop' plugin
  # we get the file name with extension.
  temp_name:
    -
      plugin: explode
      source: '@file_source'
      delimiter: /
    -
      plugin: array_pop
  # Using 'concat' plugin with uri (defined under 'constants') and filename,
  # we get the destination file path. The 'urlencode' plugin is used to form 
  # a valid destination URL for the file.
  file_destination:
    -
      plugin: concat
      delimiter: /
      source:
        - constants/file_dest_uri
        - '@temp_name'
    -
      plugin: urlencode
  # Assign temp_name value to 'filename' attribute.
  filename:
    plugin: default_value
    default_value: '@temp_name'
  # Copy the file from the source location and add it to the destination 
  # using 'file_copy' plugin. If a file already exists, we can specify if 
  # the file should be replaced by the new file or the new file should 
  # be renamed and then copied.
  uri:
    plugin: file_copy
    source:
      - '@file_source'
      - '@file_destination'
    file_exists: replace
  uid:
    plugin: default_value
    default_value: 1
  status:
    plugin: default_value
    default_value: 0

...

Filenames are not distinctly specified in the source XML so we derive it for the ‘filename’ from the URL given for the source file. The ‘explode’ and ‘array_pop’ plugins implement the PHP explode and array_pop functions on the source data.
The ‘concat’ plugin joins the specified URI and file name to create a complete path for the file destination. This path is then validated and encoded using the ‘urlencode’ plugin.
‘file_copy’ plugin copies the file from the source to destination. We can specify the operation to perform in case the file already exists with the ‘file_exists’ parameter of the plugin. The existing file can be replaced or renamed as per the requirement.

DESTINATION:

...

destination:
  plugin: entity:file
migration_dependencies: {}
# Under this, we define module dependencies. This ensures that the migration 
# configuration will be removed once the module is uninstalled when 
# migration is successfully completed.
dependencies:
  enforced:
    module:
      - custom_migration

Since our target entity to be created is of type file, we specify the value ‘entity:file’ as the destination plugin.

This template represents a basic configuration that can be used for migration of files from an XML source. We can write migration templates for other entities following the same strategy as illustrated. First, we identify the key fields for which the data is to be extracted and define XPath selectors to obtain the values in the source section. Second, we operate on these values using one or more process plugins and map the results to the entity fields. And third, we define the entity type to be created after the migration.

If you have any query or suggestion related to this post, please let us know through your comments...