Natural language processing has improved substan-tially in the last few years due to the increased computationalpower and availability of text data. Bidirectional Encoder Representations from Transformers (BERT) have further improved theperformance by using an auto-encoding model that incorporateslarger bidirectional contexts. However, the underlying mechanisms of BERT for its effectiveness are not well understood. In this paper we investigate how the BERT architecture and its pre-training protocol affect the geometry of its embeddings and the effectiveness of its features for classification tasks. As an auto-encoding model, during pre-training, it produces representationsthat are context dependent and at the same time must beable to “reconstruct” the original input sentences. The complex interactions of the two via transformers lead to interesting geometric properties of the embeddings and subsequently affectthe inherent discriminability of the resulting representations. Our experimental results illustrate that the BERT models do not produce “effective” contextualized representations for words and their improved performance may mainly be due to fine-tuningor classifiers that model the dependencies explicitly by encoding syntactic patterns in the training data.