Matching Something Within Something Else

A common class of problems that people try to solve with regular expressions is to find all occurrences of a certain pattern, but only within the occurrences of another pattern. This example illustrates this with a block of HTML that contains div tags and paragraph tags within those div tags as well as outside the div tags. We want to match all the paragraphs within the div tags, but not those outside the div tags.

A problem like this is best solved using two regular expressions. Use one regular expression to match the div tags. Use a second regular expression to match the paragraph tags within the matches of the first regular expression. A bit of procedural code glues everything together.

RegexMagic can generate only one regular expression at a time. So we’ll tackle this problem in two steps. We’ll create the two regexes separately and have RegexMagic generate a source code snippet for each. We’ll combine the two code snippets ourselves.

  1. Click the New Formula button on the top toolbar to clear out all settings on the Samples, Match, and Action panels.
  2. On the Samples panel, paste in one new sample:
    <h1>Matching Something Within Something Else</h1>
    <p>Introduction</p>
    <div>
    <p>We want <i>this</i> paragraph.</p>
    <table><tr><td>We</td><td>don't</td><td>want</td><td>tables</td></tr></table>
    <p>We want this one too.</p>
    </div>
    <p>We don't want this one.</p>
    <p>Nor this one.</p>
    <div>
    <p>Another one we want.</p>
    <p>:-)</p>
    </div>
    <p>The end.</p>
  3. Set the subject scope to “whole sample”.
  4. On the Match panel, set both “begin regex match at” and “end regex match at” to “anywhere”.
  5. Select the first occurrence of <div> in the sample.
  6. Click the Mark button. RegexMagic automatically adds a literal text pattern that matches the text we marked.
  7. In the settings for the “literal text” pattern, tick the “case insensitive” checkbox.
  8. Select everything between the <div> we just marked and the first </div> that follows it, including all the line breaks. One way to do this is to put the cursor at the end of the line <div> and then press Shift+Right and then Shift+Down until the cursor is before the </div>.
  9. Click the Mark button to make the whole contents of the div tag the second field.
  10. In the “pattern to match field” drop-down list, select “match anything” for field 2.
  11. In the “match anything except” drop-down list, select “nothing”. We want field 2 to match absolutely anything.
  12. Set the left hand “repeat this field” spinner for field 2 to zero and tick the “unlimited” checkbox to allow field 2 to match any number of characters.
  13. Set “how to repeat this field” for field 2 to “as few times as possible”. We want the regex to end its match at the first </div> tag, not the last one.
  14. In the samples, select the first </div> tag.
  15. Click the Mark button. RegexMagic automatically adds another “literal text” field to match the closing tag.
  16. In the settings for the “literal text” pattern, tick the “case insensitive” checkbox.
  17. On the Regex panel, select “PHP preg 8.0.0–8.1.24” as the application, turn off free-spacing, and turn off mode modifiers. Click the Generate button, and you’ll get this regular expression:
    <div>\p{Any}*?</div>

    Required options: Case insensitive.
    Unused options: Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Greedy quantifiers.

  18. The Samples panel now confirms our regular expression matches all div tags in the sample:
    <h1>Matching Something Within Something Else</h1>
    <p>Introduction</p>
    <div>
    <p>We want <i>this</i> paragraph.</p>
    <table><tr><td>We</td><td>don't</td><td>want</td><td>tables</td></tr></table>
    <p>We want this one too.</p>
    </div>
    <p>We don't want this one.</p>
    <p>Nor this one.</p>
    <div>
    <p>Another one we want.</p>
    <p>:-)</p>
    </div>
    <p>The end.</p>
  19. On the Use panel, in the Function drop-down list, choose “iterate over all matches in a string”.
  20. Set the “subject text” parameter to $subject and the “result array” parameter to $div.
  21. The Use panel now shows the first part of our code. Copy it into your PHP source code editor:
    preg_match_all('%<div>\p{Any}*?</div>%ui', $subject, $div, PREG_PATTERN_ORDER);
    for ($i = 0; $i < count($div[0]); $i++) {
    	# Matched text = $div[0][$i];
    }
    
  22. Now we move on to the second regular expression to match the paragraph tags. The second regex is very similar to the first. Instead of matching the opening and closing pair of a div tag and everything between it, the second regex will match a p tag. We’ll take a shortcut and edit the RegexMagic formula that we used to generate the first regex instead of starting over.
  23. On the Match panel, use the “select field” drop-down list to select field 1.
  24. Replace <div> with <p> as the text this field should match.
  25. On the Match panel, use the “select field” drop-down list to select field 3.
  26. Replace </div> with </p> as the text this field should match.
  27. The Regex panel now generates this regular expression:
    <p>\p{Any}*?</p>

    Required options: Case insensitive.
    Unused options: Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Greedy quantifiers.

  28. Because we took a shortcut, the Samples panel looks a bit messy. The intense highlighting shows that our regex matches all paragraph tags, including those outside the div tags. That’s OK. RegexMagic handles only one regular expression at the time. We’ll combine them in our PHP code to get only the paragraph tags inside the div tags. You can ignore the faint highlighting on the Samples panel. It indicates the text that we marked to match the div tags, which our modified regex no longer does.
    <h1>Matching Something Within Something Else</h1>
    <p>Introduction</p>
    <div>
    <p>We want <i>this</i> paragraph.</p>
    <table><tr><td>We</td><td>don't</td><td>want</td><td>tables</td></tr></table>
    <p>We want this one too.</p>
    </div>
    <p>We don't want this one.</p>
    <p>Nor this one.</p>
    <div>
    <p>Another one we want.</p>
    <p>:-)</p>
    </div>
    <p>The end.</p>
  29. On the Use panel, select the function “get an array of all regex matches in a string”.
  30. Set the “subject text” parameter to $div[0][$i]. The variable holding the result in the loop in the first code snippet.
  31. Set the “result array” parameter to $p.
  32. The Use panel now shows the second part of our code:
    preg_match_all('%<p>\p{Any}*?</p>%ui', $div[0][$i], $p, PREG_PATTERN_ORDER);
    $p = $p[0];
    
  33. Paste this part into your PHP code editor, inside the loop of the first part:
    preg_match_all('%<div>\p{Any}*?</div>%ui', $subject, $div, PREG_PATTERN_ORDER);
    for ($i = 0; $i < count($div[0]); $i++) {
    	preg_match_all('%<p>\p{Any}*?</p>%ui', $div[0][$i], $p, PREG_PATTERN_ORDER);
    	$p = $p[0];
    }
    
  34. Add one line and change one line of code to create a new variable $pwithindiv that will hold all of the paragraph tags within div tags in the string $subject:
    $pwithindiv = array();
    preg_match_all('%<div>\p{Any}*?</div>%ui', $subject, $div, PREG_PATTERN_ORDER);
    for ($i = 0; $i < count($div[0]); $i++) {
    	preg_match_all('%<p>\p{Any}*?</p>%ui', $div[0][$i], $p, PREG_PATTERN_ORDER);
    	$pwithindiv = array_merge($pwithindiv, $p[0]);
    }
    

Though this example requires a lot of steps, it’s all very straightforward. You simply generate two regexes independently, one to match the outer text, and one to match the inner text. In your source code you combine the two regexes to make the second regex search through only text matched by the first regex.

If you’re not developing software, you can use the same method if you’re using an advanced grep tool such as PowerGREP that allows you to use more than one regular expression to run your searches. In PowerGREP, set the “action type” to “search”. Then set “file sectioning” to “search for sections”. Paste the regular expression that matches the outer text (in this example the one for the div tags) into the “section search” box. Then paste the regex for the inner text (the one for p tags) into the main part of the action in PowerGREP. When you execute this action, PowerGREP use the file sectioning regex to find all the div tags, and then use the main regex to find the paragraphs within the div tags only. This is no different from what our PHP code does, except that it requires no programming.

Related Examples

Reference