Results 1 to 7 of 7

Thread: xml embed references. halp! :(

  1. #1
    Registered User upb's Avatar
    Join Date
    May 2003
    Posts
    50
    Blog Entries
    4

    xml embed references. halp! :(

    Hi.

    Could anyone please offer any tips/pointers/ideas on how to optimize the following:

    For each element in document, that has a href attribute and no children elements and no text children nodes,
    find the element with id = the value of that href attribute and move all children elements and attributes (except id) to the referencing element, and remove the referenced element.

    The current implementation takes 0,09 sec per root level detached element on a 10MB xml document, which amounts to about 3 minutes.
    That is bearable but the situation is hopeless with more elements ...

    Code:
    protected static void EmbedRefs(ref XmlNode n, ref XmlNode container)
    {
    	int i = 0;
    	XmlNodeList detachedChildren = n.SelectNodes("child::*[@href and not(text())]");
    	foreach (XmlNode child in detachedChildren)
    	{
    		string href = child.Attributes["href"].Value.Substring(1);
    		XmlNode referee = container.SelectSingleNode("child::*[@id='" + href + "']");
    
    		EmbedRefs(ref referee, ref container);
    
    		child.Attributes.RemoveNamedItem("href");
    
    		while (referee.FirstChild != null)
    		{
    			XmlNode fc = referee.FirstChild;
    			referee.RemoveChild(fc);
    			child.AppendChild(fc);
    		}
    
    		foreach (XmlAttribute attr in referee.Attributes)
    		{
    			if (attr.LocalName == "id")
    				continue;
    
    			child.Attributes.Append((XmlAttribute)attr.Clone());
    		}
    
    		referee.ParentNode.RemoveChild(referee);
    	}			
    }
    thx to anyone who takes a look

  2. #2
    Pii
    Guest
    strange implementation of the tree. such copying should take O(1) time with simple pointer swapping.
    I understand this piece of code takes O(tree size^2) time.
    traverse the whole tree and make a dictionary with href -> node pointer elements.
    with good dictionary implementation time should drop to O(nlogn) (O(n) for dictionary precomputation).
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  3. #3
    Registered User upb's Avatar
    Join Date
    May 2003
    Posts
    50
    Blog Entries
    4

    ver3

    Big Thanks Pii for your suggestion

    I found what the bottleneck was:
    the repeated call to find the referee for each referrer

    stats for the new versions are:
    9.11.2007 17:21:36 select detached >
    9.11.2007 17:21:36 <. have 44940 items
    9.11.2007 17:21:37 build map >
    9.11.2007 17:21:37 <. took 0,7344925sec
    9.11.2007 17:21:37 embed >
    9.11.2007 17:22:11 <. took 33,724145sec

    The improvement comes from pulling the referrers and referees with a single XPath operation..


    I'll test with different amounts of data to discovere the time complexity,
    it seems nothing can be assumed of ms's dom implementation (i would think selecting a direct descendant by an attribute would be cheap.)
    and my inability to think about most basic properties of algorithms before implementing them :P
    Attached Files Attached Files
    Last edited by upb; November 9th, 2007 at 11:01.
    “The key to understanding complicated things is to know what not to look at and what not to compute and what not to think.”

  4. #4
    Registered User upb's Avatar
    Join Date
    May 2003
    Posts
    50
    Blog Entries
    4

    and now for the real WTF

    I was wondering about the 30sec for embed operation.. and

    First we refeactor a bit by extracting the embed operation for a referrer
    Code:
    foreach (DictionaryEntry reference in references)
    {
    	XmlNode referrer = (XmlNode) reference.Key;
    	XmlNode referee = (XmlNode) reference.Value;
    
    	EmbedNode(referee, referrer);
    }
    This of course doesn't change the performance.

    Move the referee finding to the Embed part:
    Code:
    foreach (XmlNode child in affectedNodes)
    {
    	if (child.Attributes["href"] != null)
    	{					
    		references.Add(child, null);
    	}
    }
    Code:
    foreach (DictionaryEntry reference in references)
    {
    	XmlNode referrer = (XmlNode) reference.Key;
    	string href = referrer.Attributes["href"].Value.Substring(1);
    	XmlNode referee = (XmlNode)IdToNode[href];
    	
    	EmbedNode(referee, referrer);
    }
    This doesn't change the performance either, why should it,
    but why use a dictionary when we only hold values and iterate it?
    Change it to an ArrayList:
    Code:
    ArrayList references = new ArrayList();
    and
    Code:
    ..
    references.Add(child);
    ..
    fix the iteration aswell:
    Code:
    foreach (XmlNode referrer in references)
    This change doesnt affect performance either, right?
    wrong:
    Code:
    9.11.2007 23:09:12 embed >
    9.11.2007 23:09:13 <. took 0,6407275sec
    I suspect it has to do with modifying objects that are used as Hashtable keys maybe..
    “The key to understanding complicated things is to know what not to look at and what not to compute and what not to think.”

  5. #5
    Pii
    Guest
    mate, there is no magic in calculating complexity.
    if I understand correctly, what you did before is:
    for each node, traverse whole tree to find node with "href" value matching href from the first node.
    if there are n nodes, then your execution time is n*n*c where c is time needed for one "visit node" operation.
    you have to visit each node anyway, so the only place to optimize is the searching. instead of searching whole tree each time, you can remember (href, [node1,..,nodeN]) pairs in some "fast" data structure like dictionary. then you can simply ask the dictionary for the node set.

    ofc maybe I'm missing something here - I admit I haven't read your code.
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  6. #6
    Registered User upb's Avatar
    Join Date
    May 2003
    Posts
    50
    Blog Entries
    4
    Yes i agree, but there is a limitation i didnt mention before (which make it not O(n*n), but still your comment helped to get me thinking):

    All the elements with @id are children of a known element (container), so it's not needed to search the whole tree, just direct children of a node.

    That's why i thought the operation of finding the referee would be cheap.

    The number of @href elements equals the number of @id elements (m), so
    it's smth like O(m* // times embedRefs is called
    (m + // select referee
    n/m + // select detached children
    c // move children
    )) = O(m*m + n)

    But thx again
    “The key to understanding complicated things is to know what not to look at and what not to compute and what not to think.”

  7. #7
    I'd first parse the input XML to generate a hash table of the element names to offsets within the file, then copy each element in the hash table to the output XML, replacing the hrefs with their content as they're copied and marking the element name in the table so it doesn't get copied. This is O(N) time complexity where N is the total number of elements in the input.

Similar Threads

  1. [Q] embed exe as resource inside a win32 exe and launching from memory
    By Shub-nigurrath in forum Advanced Reversing and Programming
    Replies: 7
    Last Post: December 15th, 2013, 18:01
  2. Changing the address a dll references
    By Steve110 in forum OllyDbg Support Forums
    Replies: 4
    Last Post: June 19th, 2012, 17:46
  3. String references
    By Pompeyfan in forum OllyDbg Support Forums
    Replies: 8
    Last Post: March 3rd, 2004, 04:05
  4. help,Need all the asm references
    By highfly in forum Tools of Our Trade (TOT) Messageboard
    Replies: 3
    Last Post: January 28th, 2004, 22:47

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •