I thank the three reviewers for their valuable comments and suggestions, which helped a lot to improve this paper. In addition to the change suggested by the reviewers (which are detailed below), this second version comes with all the required suppelmentary material, including a new version of the archeofrag package (v0.7).
I would also appreciate the Rmd file as an additional supplemental document for easier reproducibility, but the code can be copied from the pdf
The Rmd file has been added to the supplementary materials. I also checked and ensured that all the materials needed to reproduce the content of this paper are provided.
1) On Page 3, the first Figure and the discussion related to it are somewhat abstract. I would suggest to add a real-world archaeological example to explain to the dumb folks like me what this actually means in a concrete archaeological example. Otherwise the abstraction might scare people off.
A concrete example is given in Figure 2. It might be better to have Figure 2 before Figure 1, but this is not optimal for the sake of the presentation. I assume that if things are unclear to the reader in Figure 1, it should become clear few paragraphs later with Figure 2.
2) It would be nice, although not absolutely necessary, to see how this method could be deployed in conjunction with other methods of stratigraphic analysis, as well as Bayesian modelling of radiocarbon dates. I would think that this method would be extremely useful in fine-tuning the modelling of C14 dates. It would be useful if the author could maybe comment on that and give an idea how this method might be used in combination with others?
This is an interesting suggestion, but far beyond the scope of this paper. Development of the TSAR method (or based on this method) are ongoing and not presented in this first paper. For example, the development version of the archeofrag package includes (from version 0.6.6, see https://github.com/sebastien-plutniak/archeofrag) new methods to weight the edges based on morphometric values of the objects and on the spatial distance between them.
3) I guess this method is still under development, but I wonder if the author might be able to comment a little more on how we could make this method something more commonplace? At the moment, it seems fairly inaccessible, so what will be the pathway to making it more accessible to a wide range of archaeologists (especially those like me who are a bit challenged by the maths)?
This paper is intended to give an in-depth presentation of a method that can be used without a complete understanding of all the aspects addressed in this paper. The "archeofrag" R package offers a set of functions to apply the method without going into its mathematical aspects. Given that not all archaeologists manage the R language, I also developed an online interface from which all users can experiment (https://analytics.huma-num.fr/Sebastien.Plutniak/archeofrag). This online application, with few options for the moment, will be further developed in the future.
The “preliminary definitions” section includes some useful information, but I also think this could be reduced not just to save space but also to add to the clarity of the argument.
The reference to connections relying on “solid-state physics” and then the next sub-section with the discussion of Queen and Rook contiguity and the analogy to spatial analysis made a pretty clear distinction a bit more muddled in my reading.
I think a very simple definition of connection and similarity refits and Figure 2 is plenty and felt that the Queen/Rook example did not add clarity and was not referred to again so could be removed.
section 2.1.3 Topological properties could be basically eliminated
I added a figure and revised the text for clarity. It seems important to keep the paragraphs on contiguity, because it is an important aspect when recording the data for refitting analysis. Consequently, I think it must be addressed in the main body of the text (and not in the supplementary material), because the user of the TSAR method will face this problem.
section 2.1.2 could be simplified to make the two basic points that are necessary for the rest of the paper
1) archaeologists identify potential refits by either identifying pieces that fit together across substantial surfaces in 3D space or alternatively by looking for similarities in the objects themselves
2) you are only focused on the former in this paper.
I simplified this paragraph by removing the part on the epistemological aspects of the two types of relationships.
It wasn’t clear to me how the “merge graphs” step in the two initial layers works. I can think of several ways you might have chosen to implement this, but I do not think the choices you made are completely clear in this draft or the supplement. I think a short paragraph would be needed to address this.
I rephrased this part, hoping that it is clear now. About the way the algorithm works, I also added a description in pseudo-code, which complements the flowchart diagram. The reader interested in learning more about it can read the code of the function, included in the archeofrag package and written with abundant comments.
the degree to which missing connections would hamper our ability to connect empirical patterns to generative processes. Given the nature of fragmentation and post-depositional processes, I expect that it would be common for archaeologists to miss refits due to damage to potential conjoining edges of pieces or all manner of other processes. How would this influence your results? I think this issue merits some discussion at least and perhaps some small analyses.
I had one suggestion: missing connections could potentially be evaluated with an additional global parameter such as a probability that a connection is removed once the graphs have been formed to essentially simulate “missed” or unidentifiable connections. Especially with the focus on “connections” rather than similarity, I think it’s likely that missed connections would be present in any empirical refit study and this simulation model provides a setup that could help evaluate the impacts.
Thank you for this very inspiring suggestion. I developed a new function in the package to simulate missed connections (the "frag.observer.failure" function), and used it to test the robustness of the edge count method versus the TSAR method. This gave a strong and additional argument supporting the TSAR method, demonstrating that it is more robust and less sensible to the “lack” of information. I shortly reported these results in section 3.2.2 (to not overload the main text) and referred to the supplementary materials for details.
section 3.1.1 you compare the admixture model and others to the archaeological intuition. Given the variability in the archaeological intuitive coding, I don’t know that the methods used here fitting in the same cluster is a particularly convincing argument in favor of these new methods (and that’s okay).
To me, the results you present suggest that people are generally not great at evaluating the relative order of topological patterns and that, in and of itself, argues for some sort of automated approach like you’ve developed here.
The clustering diagram and the discussion in the text is a bit confusing, however, as it stands as not every bar in the cluster diagram is labeled and it’s unclear which “secondary clusters” you are referring to in the text.
Beyond this, the Figure 8 caption says you are evaluating the four methods but it’s not entirely clear to me which is represented by which label in the existing cluster tree diagram. I think this figure will need to be modified.
I modified the figure and its caption, making them more clear, and revised the text as suggested.
The results discussed and shown in Figure 9 are helpful for thinking about how different values for these variables change with variation in balance and disturbance. I think this figure could be improved with a bit more guiding text. The x-axis is labeled Cohesion but Admixture is shown on this same axis with the same 0-1 scale so I would suggest changing that. It would also help to label the rows and columns with balance and disturbance respectively. Finally, I would suggest just increasing the border for the Admixture color as it isn’t really visible in the plots further to the left.
Also, I think this is an excellent place for you to return to your Table 1 insights and discuss how you might interpret various of these plots in light of that table.
I improved the readability of the figure as suggested (I modified the labels and increased the border of the red shades representing the admixture IQR). The caption of this figure and the text have been re-written, with reference to Table 1 (which has been modified, limiting its use for direct interpretation in terms of post-depositional processes). The correspondence between the Figure with simulated results and the Table has also been enhanced by reordering their rows and colums in similar fashion.
Figure 10: It would be interesting if these were also associated with a non-parametric correlation test like Spearman’s rho just to get a sense of how they differ in terms of rank order correlations.
Consider splitting Figure 11 into two figures so that the labels in the boxplot are clearer.
I modified the figure but kept the two plot alongside (to avoid too many figures and, above all, because they represent the same data and it makes sense to have them next to each other — I would have preferred to use margin boxplots, but there is a limitation in ggplot2, it is not possible to use facets and margins plots at the same time). In addition, as suggested, I computed the Spearman coefficients on the ranks of the 10 theoretical graphs computed by the four methods. A table reports the results, which are also discussed in the text.
The empirical example from Liang Abu is informative for thinking about how this all would actually work in practice. Notably, the values for cohesion differ in a pretty obvious way but the other three measures (excluding Modularity) show pretty minor differences (and all show 0 for the 0&1 comparison which makes sense). How might you think about these results in relation to your schema in table 1? Is the difference in Cohesion substantial? What would you interpret as a meaningful difference here? Discussing this a bit in section 3.3.1 would be helpful in walking a reader through what you’ve done and provide a justification for the hypothesis testing and simulation you do in the next section.
The discussion in this paragraph has been revised and completed accordingly, with reference to Table 1 as suggested and strengthening the relation with the use of simulation in the next section.
The hypothesis testing is an important part of the overall argument and I have some questions and need a bit of clarification in a couple of places here. I followed along with your results and also replicated them in R using the code in the supplement. I found the presentation of these results to be fairly brief given that some major interpretations hinge on them. The set up makes sense as your question is basically whether you could consider this context as having one or two layers initially. You use Wilcox test to assess rank order differences in several variables, which seems like a sensible approach and find evidence for differences in the simulated distributions and not others. Given that you have a low but non-null admixture you are really conducting these simulations to assess differences in cohesion in terms of the schema you laid out in table 1. Since scenario 3 and scenario 4 have the same interpretation (and admixture doesn’t differ for either scenario) how do the results you present in this section and Figures 12 and 13 related to the setup you initially provided in table 1? I feel like this needs more discussion. The results for specific variables differ in terms of which hypothesis is more or less supported. The results seem equivocal for several variables and looking at the supplement although there are statistically significant rank order shifts the distributions are quite similar for some variables suggesting the actual effect is small. You land on an interpretation of two independent layers but I think this needs to be better justified. Given the low admixture value and the setup in table 1, is there a scenario that would have generated different results?
Thank you for pointing this. I added a table providing a “grid” for interpretation (Table 7). The plots showing the simulated and empirical results by parameter now also include a boxplot, and the Interquartile range is used to get a better reading of the relation between the empirical value and the simulated values, and finally a more objective interpretation. However, as you suggest in your comment, the interpretation is (must be) left to the archaeologist, since the final assertion is qualitative. I stress this point in the text and present a more prudent interpretation for the Liang Abu layers 1 and 2 case study.
I would also be interested in seeing an exploration of the relationship between structural admixture and the absolute difference in cohesion for a range of scenarios.
this empirical example is only able to capture a small range of the possible scenarios that might be encountered in real settings. In general, I would like to see what kind of numbers would generate higher admixture values and how those might compare to what is shown here.
For sure, the Liang Abu case study is limited. To present a wider range of possibilities is the role of Figure 9 (now Figure 10), unless I do not understand exactly what do you mean here.
In general, the set up in the beginning of the paper outlines the relationship between admixture and cohesion in table 1 but this doesn’t come in to play in the interpretation of the empirical example and I think this makes the final interpretation of the hypothesis test less impactful. I would suggest returning to the table 1 set up in the hypothesis testing section and perhaps even discussing in more detail other realistic simulated datasets that would land you somewhere else on table 1.
I think the major thing that could be improved is the connection between the initial discussion of the method (section 2.2 in particular) and the hypothesis testing and simulation in the end.
I tried to satisfy this general demand (see answers to the previous points).
In addition to the text and supplement, I also installed and reviewed the R package. I was able to install this but I had to roll back to and earlier (3.6) version of R as one dependency was not available for the most recent version (4.0) of R. If that is an easy fix (looking at the BiocManager packages) it would be helpful to potential future users. As for the supplement, I was not able to fully replicate those results because there are a couple of files that were called in the code that weren’t provided on the OSF link as the markdown document in pdf was the only document posted. I think it would be useful to post the supplement as an Rmd file along with all of the required files to completely replicate the results. Once I was able to get the package installed in R 3.6, I was able to run all of the functions without any problems.
The compatibility of the archeofrag package is checked with continuous integration procedues (on github for macOS, Windows, and linux (Ubuntu 20.04); here: https://github.com/sebastien-plutniak/archeofrag/actions/runs/785741281, and on Travis CI for Linux Ubuntu (18.04.5 LTS), here: https://travis-ci.org/github/sebastien-plutniak/archeofrag). I hope that other users will not encounter troubles and will continue to maintain the package in the future.