Photoacoustic (PA) imaging is an emerging hybrid medical imaging modality involving the optical excitation of chromophores - light-sensitive molecules like hemoglobin and lipids, to infer underlying vascular structure. Supplying them energy in the form of pulsed laser results in rapid successive thermoelastic expansion and contraction, resulting in the generation of ultrasound, which can then be measured using transducer arrays. Raw sensor data is represented in k-Space from which the Cartesian equivalent is reconstructed using rule-based algorithms. These reconstructions tend to be noisy and have artifacts, but the recent widespread adoption of deep learning has facilitated the post-processing of reconstructions to significantly improve them. UNet, in particular, has had a far-reaching impact on the medical imaging domain, and PA imaging has been no exception, seeing a myriad of solutions based on it. In this paper, we investigate the efficacy of replacing convolution-based feature generation for post-processing PA reconstructions with a Vision Transformer-based (ViT) approach owing to its recent success in computer vision. Specifically, we examine the ability of Shifted Window (Swin) ViTs to restore an artifact-free vascular image from an artifact-heavy image reconstructed using the time-reversal algorithm.